Jeremy Chadwick wrote:
First and foremost, you've forgotten to CC the mailing list on all but one of your replies. I'll assume this is intentional, but it's probably not for the best, as readers may find your post and wonder what the outcome was.
It was not intentional, it hit reply instead of reply-all. Sorry. I will reply this to the list, so other interested parties can follow the thread and your informative replies.
On Tue, Aug 05, 2008 at 02:47:45PM +0200, Sebastiaan van Erk wrote:Hi, Thanks for the reply. Jeremy Chadwick wrote:Well, I've got a 450W Asus PSU in there, but I've also got 6 hard disks and 1 dvd-rom drive (mostly inactive) in there. The hard disks are mostly 250/300GB but the two new ones are 1TB SATA drives. But the 450W should easily be enough, shouldn't it?Yes, most of the Silicon Image ICs I've read about have odd driver problems or general issues (even under Windows). The system rebooting is an odd one; you sure your PSU can handle two disks?Without getting into semantics, a 450W PSU may be on the light side for 6 disks. I'm fairly amazed you're able to power up that machine without disk errors or other problems during POST. You'll be having 6 disks spin up all simultaneously -- and spin-up is when disks draw the most power, and possibly during normal operation. If you have a different (or larger) PSU, I would recommend trying that to see if it addresses your problem. A PSU which isn't providing enough power will cause the disks to occasionally disconnect from the bus, or the machine sporadtically lock up, reboot (power-cycle), or other odd things.
Unfortunately I don't have a larger PSU lying around, but I could buy one; though I'd like to try some other stuff first because I've had 6 disks in my PC before without any problems.
I did look at the smart stats [pasted them below]. What I will try next is just to switch the two 250GB SATA drives on my main board with the two 1TB drives on the controller and see if I still get the problems if I really increase the load on the two 1TB drives.[/var/log/messages before the crash]Aug 5 11:16:14 piglet kernel: g_vfs_done():mirror/gm1s1e[WRITE(offset=111376236544, length=16384)] error = 6Aug 5 11:16:17 piglet last message repeated 9 timesAre you sure this is being caused by the controller? Have you checked SMART statistics on both disks? Assuming error == errno, errno 6 is "Device not configured".More and more information about your system configuration is coming to light. Your original post didn't disclose any of that; now I know you have 6 disks in the system, 2 of which are using on-board SATA (no idea what controller), and 2 which are using a Silicon Image controller. What are the remaining 2 disks connected to?
Sorry that I didn't give you that information immediately. The problem when you do that though is that the post is sometimes ignored because it is deemed too long or complicated (at least I've seen that happen). I'll glady post any relevant data.
My other (on-board) SATA controller is a VIA controller; and I've never had any problems with it (although the hardware raid messed up once a year or 2 ago, and since then I've been using software raid without any issues).
[EMAIL PROTECTED]:15:0: class=0x010400 card=0x71421462 chip=0x31491106 rev=0x80 hdr=0x00
vendor = 'VIA Technologies Inc'
device = 'VT8237 VT6410 SATA RAID Controller'
class = mass storage
subclass = RAID
The remaining disks are PATA disks which are in the on-board IDE
controller. It's a legacy computer that's been upgraded a lot, though
it's not too obsolete, the CPU's a AMD Sempron(tm) Processor 2600+
(1599.83-MHz 686-class CPU).
Your recommended method of troubleshooting (swapping the 250G for the 1TB) is a good idea. But hear me loud and clear: just because you switch the disks and the problem disappears for a few hours doesn't mean it's gone. There have been **many** people who have shown up on the mailing lists stating "I did <X thing> and now it works!", only to find that a week later it *didn't* fix the problem.
Yes, I don't really expect it to solve the problem, but was thinking that at least I could try and stress test the known working disks on the controller and try to see if it's the controller that's the problem or the disks (or something else). I've been able to reproduce the crashes pretty well by just doing a lot of disk IO on the 1TB disks only (so the other disks were pretty idle during the tests).
I'll try reconfigure the geom. I used an online tutorial, but I'm not quite sure that I did everything correctly, though fsck worked alright. I did do this one differently than usual though, usually I use full disk mirror after I already initialized one of the disks, and then I convert it to a mirror by using:There's been recent discussion of such messages being caused by the use of gmirror or gjournal, when the mirror/journal is improperly set up. (In one users' case, he was receiving similar errors, as well as the filesystem failing during fsck. Turns out he incorrectly configured journalling, which nuked the last ~1MB of his UFS filesystem.) I'm not saying this is the reason for the messages you see, but it's something to keep in mind.sysctl kern.geom.debugflags=16 gmirror label -v -b round-robin gm0 /dev/ad0 gmirror insert gm0 /dev/ad2I'm not familiar with gmirror or gjournal, so I can't help much here. Your syntax looks fine based on what's in the gmirror(8) manpage.However, when I did check the smart stats again, I noticed I'd been smartctling the wrong disk (duh), and smart was not enabled on the new disks. I enabled it now, and it comes with a bunch of warnings and other stuff....You played with the state of SMART on the disks without providing any of the output before doing so. *sigh* Usually SMART is enabled by system BIOSes after the disks are powered on, but some systems do not do this. It also depends on what they're connected to; an external SATA controller might not do this, but then again it might. When you received the message from smartctl that SMART wasn't enabled, what ***exactly*** did it say? Did it say it was disabled and then not provide any SMART statistics, or did it say it was disabled and showed you stats? If the latter (2nd option), then I wouldn't worry about turning on SMART using smartctl; I've seen many SCSI disks do this, and the drives keep track of SMART data regardless. If the lesser (1st option), then what you did is correct.
It was the 1st option. It gave no stats whatsoever.
Considering it wasn't enabled, maybe the errors wouldn't show up anyway, but here's the output of the smartctl command just in somebody sees something to worry about in it... (The ECC recovery count looks rather high, I tried -F samsung and -F samsung2 but that didn't help).If you ran smartctl -a /dev/adXX and it said SMART was disabled, but still provided statistics, then those statistics should be accurate. I was about to look at your SMART data, but I noticed you're using smartmontools 5.37. Please upgrade to 5.38 (yes, it's important), and provide the following output in full: smartctl -a /dev/ad4 smartctl -a /dev/ad6
I'll upgrade and attach the new statistics.
The SMART statistics you provided (for ad4 only) show no sign of problems with the disk. But it's apparent from your concern over Attribute 195 (Hardware_ECC_Recovered) that you don't know how to read the data being given to you. :-)
You're absolute right. :-) I have no idea what any of that stuff means. :-)
There are other things I'd like you to do to the disks, but let's not do them until you've upgraded smartmontools.ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 153751007This is perfectly fine. The raw value stored in the SMART data offset for index 195 contains a very large value -- this means *absolutely nothing*. What *does* matter for this attribute are the values labelled VALUE, WORST, and THRESH. There is not an easy way to explain how these numbers are calculated. The raw value for the attribute is "adjusted", based on formulas that are specific to the disk manufacturer or disk model. The end result is the number you see in VALUE. WORST is the worst value ever seen for VALUE. Lower does not mean worse; it entirely depends on what THRESH is. THRESH is the threshhold value where if VALUE equals or exceeds that number, will cause the overall SMART health status to start returning FAIL. Without getting into extreme details (I can if you want, but it will take me a very long time to go over it all; my Wiki page will eventually explain to people how to read SMART statistics properly), a VALUE of 100 with a THRESH of 0 indicates your drive, ECC-recovery-wise, is in perfect working order. WORST is 100, which means the worst value it's ever seen for that attribute is 100, confirming my point.One more detail, I'm also getting these messages in the log write now: ad4: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED> ad6: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED> ad4: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED> ad4: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED> ad6: FAILURE - SMART status=51<READY,DSC,ERROR> error=4<ABORTED> I have no idea what this means, the only google link I found mentions something about the drive speed (i.e., I have SATA II drives, but the controller/cables may be SATA I). Should I somehow limit my drives to SATA150?Slow down. Start by upgrading smartmontools to 5.38; it's possible that the older version was sending a SMART command to the disks which Samsung doesn't implement (thus returning an error, of subtype ABORTED). Version 5.38 may fix that, if that is indeed the problem. I'm not sure. Please upgrade to 5.38, then see if the errors in your log correlate with when you run smartctl -a /dev/adXX. If it does, then I'm pretty sure I know what's going on, and we'll need to contact Bruce Allen (author of smartmontools) to get it fixed.
Got it! Thanks for the answers, see the output of smartctl below. Regards, Sebastiaan
smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG HD103UJ Serial Number: S13PJ1BQ606865 Firmware Version: 1AA01112 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is: Tue Aug 5 19:06:56 2008 CEST ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (11811) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 198) minutes. Conveyance self-test routine recommended polling time: ( 21) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 090 090 011 Pre-fail Always - 4050 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 4 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 234 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 4 13 Read_Soft_Error_Rate 0x000e 253 253 000 Old_age Always - 0 183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 184 Unknown_Attribute 0x0033 100 100 099 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 054 054 000 Old_age Always - 46 (Lifetime Min/Max 40/46) 194 Temperature_Celsius 0x0022 055 052 000 Old_age Always - 45 (Lifetime Min/Max 36/49) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 153751007 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 0 Warning: ATA Specification requires self-test log structure revision number = 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1 SMART Selective self-test log data structure revision number 0 Warning: ATA Specification requires selective self-test log data structure revision number = 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG HD103UJ Serial Number: S13PJ1BQ607102 Firmware Version: 1AA01112 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is: Tue Aug 5 19:07:03 2008 CEST ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (12131) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 203) minutes. Conveyance self-test routine recommended polling time: ( 22) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 090 090 011 Pre-fail Always - 3870 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 4 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 234 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 4 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0 183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 184 Unknown_Attribute 0x0033 100 100 099 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 056 056 000 Old_age Always - 44 (Lifetime Min/Max 38/44) 194 Temperature_Celsius 0x0022 057 054 000 Old_age Always - 43 (Lifetime Min/Max 35/46) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 196672230 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 253 253 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 0 Warning: ATA Specification requires self-test log structure revision number = 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1 SMART Selective self-test log data structure revision number 0 Warning: ATA Specification requires selective self-test log data structure revision number = 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
smime.p7s
Description: S/MIME Cryptographic Signature
