Re: Stable SATA pci card for FreeBSD 6.x/7.0

Sebastiaan van Erk Tue, 05 Aug 2008 10:08:57 -0700

Jeremy Chadwick wrote:

First and foremost, you've forgotten to CC the mailing list on all but
one of your replies.  I'll assume this is intentional, but it's probably
not for the best, as readers may find your post and wonder what the
outcome was.

It was not intentional, it hit reply instead of reply-all. Sorry. I will reply this to the list, so other interested parties can follow the thread and your informative replies.

On Tue, Aug 05, 2008 at 02:47:45PM +0200, Sebastiaan van Erk wrote:

Hi,

Thanks for the reply.

Jeremy Chadwick wrote:
Yes, most of the Silicon Image ICs I've read about have odd driver
problems or general issues (even under Windows).  The system rebooting
is an odd one; you sure your PSU can handle two disks?
Well, I've got a 450W Asus PSU in there, but I've also got 6 hard disks and 1 dvd-rom drive (mostly inactive) in there. The hard disks are mostly 250/300GB but the two new ones are 1TB SATA drives. But the 450W should easily be enough, shouldn't it?


Without getting into semantics, a 450W PSU may be on the light side for
6 disks.  I'm fairly amazed you're able to power up that machine without
disk errors or other problems during POST.  You'll be having 6 disks
spin up all simultaneously -- and spin-up is when disks draw the most
power, and possibly during normal operation.

If you have a different (or larger) PSU, I would recommend trying that
to see if it addresses your problem.  A PSU which isn't providing enough
power will cause the disks to occasionally disconnect from the bus, or
the machine sporadtically lock up, reboot (power-cycle), or other odd
things.

Unfortunately I don't have a larger PSU lying around, but I could buy one; though I'd like to try some other stuff first because I've had 6 disks in my PC before without any problems.

[/var/log/messages before the crash]
Aug 5 11:16:14 piglet kernel: g_vfs_done():mirror/gm1s1e[WRITE(offset=111376236544, length=16384)] error = 6
Aug  5 11:16:17 piglet last message repeated 9 times
Are you sure this is being caused by the controller?  Have you checked
SMART statistics on both disks?  Assuming error == errno, errno 6 is
"Device not configured".
I did look at the smart stats [pasted them below]. What I will try next is just to switch the two 250GB SATA drives on my main board with the two 1TB drives on the controller and see if I still get the problems if I really increase the load on the two 1TB drives.
More and more information about your system configuration is coming to
light.  Your original post didn't disclose any of that; now I know you
have 6 disks in the system, 2 of which are using on-board SATA (no idea
what controller), and 2 which are using a Silicon Image controller.
What are the remaining 2 disks connected to?

Sorry that I didn't give you that information immediately. The problem when you do that though is that the post is sometimes ignored because it is deemed too long or complicated (at least I've seen that happen). I'll glady post any relevant data.

My other (on-board) SATA controller is a VIA controller; and I've never had any problems with it (although the hardware raid messed up once a year or 2 ago, and since then I've been using software raid without any issues).

[EMAIL PROTECTED]:15:0: class=0x010400 card=0x71421462 chip=0x31491106 rev=0x80 hdr=0x00

    vendor     = 'VIA Technologies Inc'
    device     = 'VT8237  VT6410 SATA RAID Controller'
    class      = mass storage
    subclass   = RAID

The remaining disks are PATA disks which are in the on-board IDE controller. It's a legacy computer that's been upgraded a lot, though it's not too obsolete, the CPU's a AMD Sempron(tm) Processor 2600+ (1599.83-MHz 686-class CPU).

Your recommended method of troubleshooting (swapping the 250G for the
1TB) is a good idea.  But hear me loud and clear: just because you
switch the disks and the problem disappears for a few hours doesn't mean
it's gone.  There have been **many** people who have shown up on the
mailing lists stating "I did <X thing> and now it works!", only to find
that a week later it *didn't* fix the problem.

Yes, I don't really expect it to solve the problem, but was thinking that at least I could try and stress test the known working disks on the controller and try to see if it's the controller that's the problem or the disks (or something else). I've been able to reproduce the crashes pretty well by just doing a lot of disk IO on the 1TB disks only (so the other disks were pretty idle during the tests).

There's been recent discussion of such messages being caused by the use
of gmirror or gjournal, when the mirror/journal is improperly set up.
(In one users' case, he was receiving similar errors, as well as the
filesystem failing during fsck.  Turns out he incorrectly configured
journalling, which nuked the last ~1MB of his UFS filesystem.)

I'm not saying this is the reason for the messages you see, but it's
something to keep in mind.
I'll try reconfigure the geom. I used an online tutorial, but I'm not quite sure that I did everything correctly, though fsck worked alright. I did do this one differently than usual though, usually I use full disk mirror after I already initialized one of the disks, and then I convert it to a mirror by using:
sysctl kern.geom.debugflags=16
gmirror label -v -b round-robin gm0 /dev/ad0
gmirror insert gm0 /dev/ad2


I'm not familiar with gmirror or gjournal, so I can't help much here.
Your syntax looks fine based on what's in the gmirror(8) manpage.

However, when I did check the smart stats again, I noticed I'd been smartctling the wrong disk (duh), and smart was not enabled on the new disks. I enabled it now, and it comes with a bunch of warnings and other stuff....


You played with the state of SMART on the disks without providing any of
the output before doing so.  *sigh*  Usually SMART is enabled by system
BIOSes after the disks are powered on, but some systems do not do this.
It also depends on what they're connected to; an external SATA
controller might not do this, but then again it might.

When you received the message from smartctl that SMART wasn't enabled,
what ***exactly*** did it say?  Did it say it was disabled and then not
provide any SMART statistics, or did it say it was disabled and showed
you stats?

If the latter (2nd option), then I wouldn't worry about turning on SMART
using smartctl; I've seen many SCSI disks do this, and the drives keep
track of SMART data regardless.  If the lesser (1st option), then what
you did is correct.


It was the 1st option. It gave no stats whatsoever.

Considering it wasn't enabled, maybe the errors wouldn't show up anyway, but here's the output of the smartctl command just in somebody sees something to worry about in it... (The ECC recovery count looks rather high, I tried -F samsung and -F samsung2 but that didn't help).
If you ran smartctl -a /dev/adXX and it said SMART was disabled, but
still provided statistics, then those statistics should be accurate.

I was about to look at your SMART data, but I noticed you're using
smartmontools 5.37.  Please upgrade to 5.38 (yes, it's important), and
provide the following output in full:

smartctl -a /dev/ad4
smartctl -a /dev/ad6


I'll upgrade and attach the new statistics.

The SMART statistics you provided (for ad4 only) show no sign of
problems with the disk.  But it's apparent from your concern over
Attribute 195 (Hardware_ECC_Recovered) that you don't know how to read
the data being given to you.  :-)


You're absolute right. :-) I have no idea what any of that stuff means. :-)

There are other things I'd like you to do to the disks, but let's not do
them until you've upgraded smartmontools.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always -      
 153751007


This is perfectly fine.  The raw value stored in the SMART data offset
for index 195 contains a very large value -- this means *absolutely
nothing*.  What *does* matter for this attribute are the values labelled
VALUE, WORST, and THRESH.

There is not an easy way to explain how these numbers are calculated.
The raw value for the attribute is "adjusted", based on formulas that
are specific to the disk manufacturer or disk model.  The end result is
the number you see in VALUE.

WORST is the worst value ever seen for VALUE.  Lower does not mean
worse; it entirely depends on what THRESH is.

THRESH is the threshhold value where if VALUE equals or exceeds that
number, will cause the overall SMART health status to start returning
FAIL.

Without getting into extreme details (I can if you want, but it will
take me a very long time to go over it all; my Wiki page will eventually
explain to people how to read SMART statistics properly), a VALUE of 100
with a THRESH of 0 indicates your drive, ECC-recovery-wise, is in
perfect working order.  WORST is 100, which means the worst value it's
ever seen for that attribute is 100, confirming my point.

One more detail, I'm also getting these messages in the log write now:

ad4: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED>
ad6: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED>
ad4: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED>
ad4: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED>
ad6: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED>

I have no idea what this means, the only google link I found mentions
something about the drive speed (i.e., I have SATA II drives, but the
controller/cables may be SATA I). Should I somehow limit my drives to
SATA150?


Slow down.  Start by upgrading smartmontools to 5.38; it's possible that
the older version was sending a SMART command to the disks which Samsung
doesn't implement (thus returning an error, of subtype ABORTED).

Version 5.38 may fix that, if that is indeed the problem.  I'm not sure.

Please upgrade to 5.38, then see if the errors in your log correlate
with when you run smartctl -a /dev/adXX.  If it does, then I'm pretty
sure I know what's going on, and we'll need to contact Bruce Allen
(author of smartmontools) to get it fixed.


Got it!

Thanks for the answers, see the output of smartctl below.

Regards,
Sebastiaan

smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD103UJ
Serial Number:    S13PJ1BQ606865
Firmware Version: 1AA01112
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Tue Aug  5 19:06:56 2008 CEST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (11811) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off 
support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 198) minutes.
Conveyance self-test routine
recommended polling time:        (  21) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   253   253   051    Pre-fail  Always       
-       0
  3 Spin_Up_Time            0x0007   090   090   011    Pre-fail  Always       
-       4050
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       
-       4
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       
-       0
  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       
-       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      
-       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       
-       234
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       
-       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       
-       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       
-       4
 13 Read_Soft_Error_Rate    0x000e   253   253   000    Old_age   Always       
-       0
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       
-       0
184 Unknown_Attribute       0x0033   100   100   099    Pre-fail  Always       
-       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       
-       0
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       
-       0
190 Airflow_Temperature_Cel 0x0022   054   054   000    Old_age   Always       
-       46 (Lifetime Min/Max 40/46)
194 Temperature_Celsius     0x0022   055   052   000    Old_age   Always       
-       45 (Lifetime Min/Max 36/49)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       
-       153751007
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       
-       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       
-       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      
-       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       
-       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       
-       0
201 Soft_Read_Error_Rate    0x000a   253   253   000    Old_age   Always       
-       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure 
revision number = 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD103UJ
Serial Number:    S13PJ1BQ607102
Firmware Version: 1AA01112
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Tue Aug  5 19:07:03 2008 CEST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (12131) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off 
support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 203) minutes.
Conveyance self-test routine
recommended polling time:        (  22) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       
-       0
  3 Spin_Up_Time            0x0007   090   090   011    Pre-fail  Always       
-       3870
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       
-       4
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       
-       0
  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       
-       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      
-       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       
-       234
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       
-       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       
-       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       
-       4
 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       
-       0
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       
-       0
184 Unknown_Attribute       0x0033   100   100   099    Pre-fail  Always       
-       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       
-       0
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       
-       0
190 Airflow_Temperature_Cel 0x0022   056   056   000    Old_age   Always       
-       44 (Lifetime Min/Max 38/44)
194 Temperature_Celsius     0x0022   057   054   000    Old_age   Always       
-       43 (Lifetime Min/Max 35/46)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       
-       196672230
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       
-       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       
-       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      
-       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       
-       0
200 Multi_Zone_Error_Rate   0x000a   253   253   000    Old_age   Always       
-       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       
-       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure 
revision number = 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Stable SATA pci card for FreeBSD 6.x/7.0

Reply via email to