Re: g_vfs_done error third part--PLEASE HELP!
Hello Roland and FreeBSD friends, On Fri, May 16, 2008 at 09:07:18PM +0200, Roland Smith wrote: On Fri, May 16, 2008 at 02:14:14PM +0200, Willy Offermans wrote: Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ar0s1a 20308398 230438 18453290 1%/ devfs 11 0 100%/dev /dev/ar0s1d 21321454 3814482 1580125619%/usr /dev/ar0s1e 50777034 5331686 4138318611%/var /dev/ar0s1f 101554150 18813760 7461605820%/home /dev/ar0s1g 274977824 34564876 21841472414%/share pretty normal I would say. Yes. Did you notice any file corruption in the filesystem on ar0s1g? No the two disks are brand new and I did not encounter any noticeable file corruption. However I assume that nowadays bad sectors on HD are handled by the hardware and do not need any user interaction to correct this. But maybe I'm totally wrong. Every ATA disk has spare sectors, and they usually don't report bad blocks untill the spares are exhausted. In which case it is prudent to replace the disk. Unmount the filesystem and run fsck(8) on it. Does it report any errors? sun# fsck /dev/ar0s1g ** /dev/ar0s1g ** Last Mounted on /share ** Phase 1 - Check Blocks and Sizes INCORRECT BLOCK COUNT I=34788357 (272 should be 264) CORRECT? [yn] y INCORRECT BLOCK COUNT I=34789217 (296 should be 288) CORRECT? [yn] y ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? [yn] y SUMMARY INFORMATION BAD SALVAGE? [yn] y BLK(S) MISSING IN BIT MAPS SALVAGE? [yn] y 182863 files, 17282440 used, 120206472 free (12448 frags, 15024253 blocks, 0.0% fragmentation) * FILE SYSTEM MARKED CLEAN * * FILE SYSTEM WAS MODIFIED * The usual stuff I would say. Disk corruption is never normal. It can be explained by if the machine crashed or was power-cycles before the disks were unmounted, but it can also indicate hardware troubles. Any hints are very much appreciated. So I have to conclude that the write error message does make sense and that something seems to be wrong with the disks. The next question is what can I do about it? Should I return the disks to the shop and ask for new ones? Install sysutils/smartmontools, and run 'smartctl -A /dev/adX|less', where X are the numbers of the drives in the RAID array. In the output, look at the values for Reallocated_Sector_Ct, Current_Pending_Sector, Offline_Uncorrectable, which is the last number that you see on each line. A small number for Reallocated_Sector_Ct is allowable. But non-zero counts for Current_Pending_Sector or Offline_Uncorrectable means it's time to get a new disk. sun# atacontrol status ar0 ar0: ATA RAID1 status: READY subdisks: 0 ad4 ONLINE 1 ad6 ONLINE So ad4 and ad6 are the HDs of the array. sun# smartctl -A /dev/ad6 smartctl version 5.38 [i386-portbld-freebsd7.0] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051Pre-fail Always - 3 3 Spin_Up_Time0x0007 100 100 015Pre-fail Always - 7232 4 Start_Stop_Count0x0032 100 100 000Old_age Always - 31 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 253 253 015Pre-fail Offline - 0 9 Power_On_Hours 0x0032 100 100 000Old_age Always - 1478 10 Spin_Retry_Count0x0033 253 253 051Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 253 253 000Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 31 13 Read_Soft_Error_Rate0x000e 100 100 000Old_age Always - 439070649 187 Reported_Uncorrect 0x0032 253 253 000Old_age Always - 0 188 Unknown_Attribute 0x0032 253 253 000Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 062 060 000Old_age Always - 38 194 Temperature_Celsius 0x0022 124 115 000Old_age Always - 38 195 Hardware_ECC_Recovered 0x001a 100 100 000Old_age Always - 439070649 196 Reallocated_Event_Count 0x0032 253 253 000Old_age Always - 0 197 Current_Pending_Sector 0x0012 253 253 000
Re: g_vfs_done error third part--PLEASE HELP!
On Sat, May 17, 2008 at 09:52:23AM +0200, Willy Offermans wrote: sun# atacontrol status ar0 ar0: ATA RAID1 status: READY subdisks: 0 ad4 ONLINE 1 ad6 ONLINE What ataraid(4) method are you using? Promise FastTrak? Adaptec HostRAID? Intel MatrixRAID? Please let us know, as there are some known long-standing bugs with ataraid(4) that could (no guarantee) explain what's going on. So ad4 and ad6 are the HDs of the array. sun# smartctl -A /dev/ad6 This excludes the brand/model of hard disks you have. Can you please tell us this? Different hard disk manufacturers do different things with SMART statistics. Your SMART statistics look okay, but depending upon what drive model and manufacturer is being used, they could be indicative of a problem. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: g_vfs_done error third part--PLEASE HELP!
Hello Jeremy and FreeBSD friends, On Sat, May 17, 2008 at 03:16:27AM -0700, Jeremy Chadwick wrote: On Sat, May 17, 2008 at 09:52:23AM +0200, Willy Offermans wrote: sun# atacontrol status ar0 ar0: ATA RAID1 status: READY subdisks: 0 ad4 ONLINE 1 ad6 ONLINE What ataraid(4) method are you using? Promise FastTrak? Adaptec HostRAID? Intel MatrixRAID? Please let us know, as there are some known long-standing bugs with ataraid(4) that could (no guarantee) explain what's going on. So ad4 and ad6 are the HDs of the array. sun# smartctl -A /dev/ad6 This excludes the brand/model of hard disks you have. Can you please tell us this? Different hard disk manufacturers do different things with SMART statistics. Your SMART statistics look okay, but depending upon what drive model and manufacturer is being used, they could be indicative of a problem. -- From /var/run/dmesg.boot: ar0: 476837MB Promise Fasttrak RAID1 status: READY ad4: 476940MB SAMSUNG HD501LJ CR100-12 at ata2-master SATA150 ad6: 476940MB SAMSUNG HD501LJ CR100-12 at ata3-master SATA150 I hope this is the information you are asking for. -- Met vriendelijke groeten, With kind regards, Mit freundlichen Gruessen, De jrus wah, Willy * W.K. Offermans Home: +31 45 544 49 44 Mobile: +31 653 27 16 23 e-mail: [EMAIL PROTECTED] Powered by (__) \\\'',) \/ \ ^ .\._/_) www.FreeBSD.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: g_vfs_done error third part--PLEASE HELP!
On Fri, May 16, 2008 at 02:14:14PM +0200, Willy Offermans wrote: On Mon, Apr 21, 2008 at 10:10:47PM +0200, Roland Smith wrote: Did you notice any file corruption in the filesystem on ar0s1g? No the two disks are brand new and I did not encounter any noticeable file corruption. However I assume that nowadays bad sectors on HD are handled by the hardware and do not need any user interaction to correct this. But maybe I'm totally wrong. You're right, but it depends on the type of disk. SCSI disks will report bad blocks to the OS regardless if it is about to mark the block as a grown defect or not. PATA and SATA disks, on the other hand, will report bad blocks to the OS only if the internal bad block list (which it manages itself -- this is what you're thinking of) is full. There are still many conditions where PATA and SATA disks can induce errors in the OS. If the disk is attempting to work around a bad block, and there's a physical error (servo problem, head crash, repetitive re-reads of the block due to dust, whatever -- something that ties up the disk for long periods of time), ATA subsystem timeouts may be seen, DMA errors, or whatever else. SMART stats will show this kind of problem. Unmount the filesystem and run fsck(8) on it. Does it report any errors? sun# fsck /dev/ar0s1g ** /dev/ar0s1g ** Last Mounted on /share ** Phase 1 - Check Blocks and Sizes INCORRECT BLOCK COUNT I=34788357 (272 should be 264) CORRECT? [yn] y INCORRECT BLOCK COUNT I=34789217 (296 should be 288) CORRECT? [yn] y ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? [yn] y SUMMARY INFORMATION BAD SALVAGE? [yn] y BLK(S) MISSING IN BIT MAPS SALVAGE? [yn] y 182863 files, 17282440 used, 120206472 free (12448 frags, 15024253 blocks, 0.0% fragmentation) * FILE SYSTEM MARKED CLEAN * * FILE SYSTEM WAS MODIFIED * The usual stuff I would say. How is this usual?. It appears to me you did have some filesystem corruption. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: g_vfs_done error third part--PLEASE HELP!
Willy Offermans wrote: Hello Roland and FreeBSD friends, I'm sorry to be so quite for a while, but I went away for a vacation. But now I'm back, I like to solve this issue. On Mon, Apr 21, 2008 at 10:10:47PM +0200, Roland Smith wrote: On Mon, Apr 21, 2008 at 09:04:03PM +0200, Willy Offermans wrote: Dear FreeBSD friends, It is already the third time that I report this error. Can someone help me in solving this issue? Probably the reason that you hear so little is that you provide so little information. Most of us are not clairvoyant. Over and over again and always after heavy disk I/O I see the following errors in the log files. If I force ar0s1g to unmount the machine spontaneously reboots. Nothing seriously seems to be damaged by this act, but anyway I cannot afford something bad happening to this production machine. Why would you force an unmount? Otherwise the device keeps on reporting to be unavailable and cannot be unmounted: sun# umount /share/ umount: unmount of /share failed: Resource temporarily unavailable Apr 18 20:02:19 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725068800, length=4096)]error = 5 I have no clue what the errors mean, since offsets of 290725068800, 290725072896, and 290725074944 seem to be ridiculous. Does anybody have a clue what is going on? For starters, how big is ar0s1g? If the offset is in bytes, it is around 270 GB, which is not that unusual in this day and age. I have to admit that I was a bit confused by an offset value of 290725068800. There is no indication of a unit, so I assumed that it was sector but probably it is simply bytes and then indeed the number does make sense. I'm using FreeBSD 7.0, but found the error being reported before with previous versions of FreeBSD. I can and will provide more details on demand. What does 'df' say? Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ar0s1a 20308398 230438 18453290 1%/ devfs 11 0 100%/dev /dev/ar0s1d 21321454 3814482 1580125619%/usr /dev/ar0s1e 50777034 5331686 4138318611%/var /dev/ar0s1f 101554150 18813760 7461605820%/home /dev/ar0s1g 274977824 34564876 21841472414%/share pretty normal I would say. Did you notice any file corruption in the filesystem on ar0s1g? No the two disks are brand new and I did not encounter any noticeable file corruption. However I assume that nowadays bad sectors on HD are handled by the hardware and do not need any user interaction to correct this. But maybe I'm totally wrong. Unmount the filesystem and run fsck(8) on it. Does it report any errors? sun# fsck /dev/ar0s1g ** /dev/ar0s1g ** Last Mounted on /share ** Phase 1 - Check Blocks and Sizes INCORRECT BLOCK COUNT I=34788357 (272 should be 264) CORRECT? [yn] y INCORRECT BLOCK COUNT I=34789217 (296 should be 288) CORRECT? [yn] y ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? [yn] y SUMMARY INFORMATION BAD SALVAGE? [yn] y BLK(S) MISSING IN BIT MAPS SALVAGE? [yn] y 182863 files, 17282440 used, 120206472 free (12448 frags, 15024253 blocks, 0.0% fragmentation) * FILE SYSTEM MARKED CLEAN * * FILE SYSTEM WAS MODIFIED * The usual stuff I would say. No, any form of filesystem corruption is not usual. Any hints are very much appreciated. Did you manage to create a partition larger than the disk is (using newfs's -s switch)? In that case it could be that you're trying to write past the end of the device. No, look to the following output: sun# bsdlabel -A /dev/ar0s1 # /dev/ar0s1: type: unknown disk: amnesiac label: flags: bytes/sector: 512 sectors/track: 63 tracks/cylinder: 255 sectors/cylinder: 16065 cylinders: 60799 sectors/unit: 976751937 rpm: 3600 interleave: 1 trackskew: 0 cylinderskew: 0 headswitch: 0 # milliseconds track-to-track seek: 0 # milliseconds drivedata: 0 8 partitions: #size offsetfstype [fsize bsize bps/cpg] a: 4194304004.2BSD0 0 0 b: 8388608 41943040 swap c: 9767519370unused0 0 # raw part, don't edit d: 44040192 503316484.2BSD 2048 16384 28552 e: 104857600 943718404.2BSD 2048 16384 28552 f: 209715200 1992294404.2BSD 2048 16384 28552 g: 567807297 4089446404.2BSD 2048 16384 28552 /dev/ar0s1g starts after 408944640*512/1024/1024=199680MB So I have to conclude that the write error message does make sense and that something seems to be wrong with the disks. The next question is what can I do about it? Should I return the disks to the shop and ask for new ones? #define EIO 5 /* Input/output error */ At least one of your disks is toast. Kris ___ freebsd-stable@freebsd.org mailing list
Re: g_vfs_done error third part--PLEASE HELP!
Hello Jeremy and FreeBSD friends, On Fri, May 16, 2008 at 05:27:59AM -0700, Jeremy Chadwick wrote: On Fri, May 16, 2008 at 02:14:14PM +0200, Willy Offermans wrote: On Mon, Apr 21, 2008 at 10:10:47PM +0200, Roland Smith wrote: Did you notice any file corruption in the filesystem on ar0s1g? No the two disks are brand new and I did not encounter any noticeable file corruption. However I assume that nowadays bad sectors on HD are handled by the hardware and do not need any user interaction to correct this. But maybe I'm totally wrong. You're right, but it depends on the type of disk. SCSI disks will report bad blocks to the OS regardless if it is about to mark the block as a grown defect or not. PATA and SATA disks, on the other hand, will report bad blocks to the OS only if the internal bad block list (which it manages itself -- this is what you're thinking of) is full. There are still many conditions where PATA and SATA disks can induce errors in the OS. If the disk is attempting to work around a bad block, and there's a physical error (servo problem, head crash, repetitive re-reads of the block due to dust, whatever -- something that ties up the disk for long periods of time), ATA subsystem timeouts may be seen, DMA errors, or whatever else. SMART stats will show this kind of problem. Unmount the filesystem and run fsck(8) on it. Does it report any errors? sun# fsck /dev/ar0s1g ** /dev/ar0s1g ** Last Mounted on /share ** Phase 1 - Check Blocks and Sizes INCORRECT BLOCK COUNT I=34788357 (272 should be 264) CORRECT? [yn] y INCORRECT BLOCK COUNT I=34789217 (296 should be 288) CORRECT? [yn] y ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? [yn] y SUMMARY INFORMATION BAD SALVAGE? [yn] y BLK(S) MISSING IN BIT MAPS SALVAGE? [yn] y 182863 files, 17282440 used, 120206472 free (12448 frags, 15024253 blocks, 0.0% fragmentation) * FILE SYSTEM MARKED CLEAN * * FILE SYSTEM WAS MODIFIED * The usual stuff I would say. How is this usual?. It appears to me you did have some filesystem corruption. What kind of filesystem corruption and how to solve that? I see these messages frequently if a FreeBSD machine unexpectedly reboots. Not only on this system but also on others. I never worried about it. -- Met vriendelijke groeten, With kind regards, Mit freundlichen Gruessen, De jrus wah, Willy * W.K. Offermans Home: +31 45 544 49 44 Mobile: +31 653 27 16 23 e-mail: [EMAIL PROTECTED] Powered by (__) \\\'',) \/ \ ^ .\._/_) www.FreeBSD.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: g_vfs_done error third part--PLEASE HELP!
Hello Kris, On Fri, May 16, 2008 at 02:43:24PM +0200, Kris Kennaway wrote: Willy Offermans wrote: Hello Roland and FreeBSD friends, I'm sorry to be so quite for a while, but I went away for a vacation. But now I'm back, I like to solve this issue. On Mon, Apr 21, 2008 at 10:10:47PM +0200, Roland Smith wrote: On Mon, Apr 21, 2008 at 09:04:03PM +0200, Willy Offermans wrote: Dear FreeBSD friends, It is already the third time that I report this error. Can someone help me in solving this issue? Probably the reason that you hear so little is that you provide so little information. Most of us are not clairvoyant. Over and over again and always after heavy disk I/O I see the following errors in the log files. If I force ar0s1g to unmount the machine spontaneously reboots. Nothing seriously seems to be damaged by this act, but anyway I cannot afford something bad happening to this production machine. Why would you force an unmount? Otherwise the device keeps on reporting to be unavailable and cannot be unmounted: sun# umount /share/ umount: unmount of /share failed: Resource temporarily unavailable Apr 18 20:02:19 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725068800, length=4096)]error = 5 I have no clue what the errors mean, since offsets of 290725068800, 290725072896, and 290725074944 seem to be ridiculous. Does anybody have a clue what is going on? For starters, how big is ar0s1g? If the offset is in bytes, it is around 270 GB, which is not that unusual in this day and age. I have to admit that I was a bit confused by an offset value of 290725068800. There is no indication of a unit, so I assumed that it was sector but probably it is simply bytes and then indeed the number does make sense. I'm using FreeBSD 7.0, but found the error being reported before with previous versions of FreeBSD. I can and will provide more details on demand. What does 'df' say? Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ar0s1a 20308398 230438 18453290 1%/ devfs 11 0 100%/dev /dev/ar0s1d 21321454 3814482 1580125619%/usr /dev/ar0s1e 50777034 5331686 4138318611%/var /dev/ar0s1f 101554150 18813760 7461605820%/home /dev/ar0s1g 274977824 34564876 21841472414%/share pretty normal I would say. Did you notice any file corruption in the filesystem on ar0s1g? No the two disks are brand new and I did not encounter any noticeable file corruption. However I assume that nowadays bad sectors on HD are handled by the hardware and do not need any user interaction to correct this. But maybe I'm totally wrong. Unmount the filesystem and run fsck(8) on it. Does it report any errors? sun# fsck /dev/ar0s1g ** /dev/ar0s1g ** Last Mounted on /share ** Phase 1 - Check Blocks and Sizes INCORRECT BLOCK COUNT I=34788357 (272 should be 264) CORRECT? [yn] y INCORRECT BLOCK COUNT I=34789217 (296 should be 288) CORRECT? [yn] y ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? [yn] y SUMMARY INFORMATION BAD SALVAGE? [yn] y BLK(S) MISSING IN BIT MAPS SALVAGE? [yn] y 182863 files, 17282440 used, 120206472 free (12448 frags, 15024253 blocks, 0.0% fragmentation) * FILE SYSTEM MARKED CLEAN * * FILE SYSTEM WAS MODIFIED * The usual stuff I would say. No, any form of filesystem corruption is not usual. Any hints are very much appreciated. Did you manage to create a partition larger than the disk is (using newfs's -s switch)? In that case it could be that you're trying to write past the end of the device. No, look to the following output: sun# bsdlabel -A /dev/ar0s1 # /dev/ar0s1: type: unknown disk: amnesiac label: flags: bytes/sector: 512 sectors/track: 63 tracks/cylinder: 255 sectors/cylinder: 16065 cylinders: 60799 sectors/unit: 976751937 rpm: 3600 interleave: 1 trackskew: 0 cylinderskew: 0 headswitch: 0 # milliseconds track-to-track seek: 0 # milliseconds drivedata: 0 8 partitions: #size offsetfstype [fsize bsize bps/cpg] a: 4194304004.2BSD0 0 0 b: 8388608 41943040 swap c: 9767519370unused0 0 # raw part, don't edit d: 44040192 503316484.2BSD 2048 16384 28552 e: 104857600 943718404.2BSD 2048 16384 28552 f: 209715200 1992294404.2BSD 2048 16384 28552 g: 567807297 4089446404.2BSD 2048 16384 28552 /dev/ar0s1g starts after 408944640*512/1024/1024=199680MB So I have to conclude that the write error message does make sense and that something seems to be wrong with the disks. The next question is what can I do about it? Should I return the disks to the shop and ask for new ones? #define
Re: g_vfs_done error third part--PLEASE HELP!
On Fri, May 16, 2008 at 05:37:56PM +0200, Willy Offermans wrote: sun# fsck /dev/ar0s1g ** /dev/ar0s1g ** Last Mounted on /share ** Phase 1 - Check Blocks and Sizes INCORRECT BLOCK COUNT I=34788357 (272 should be 264) CORRECT? [yn] y INCORRECT BLOCK COUNT I=34789217 (296 should be 288) CORRECT? [yn] y ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? [yn] y SUMMARY INFORMATION BAD SALVAGE? [yn] y BLK(S) MISSING IN BIT MAPS SALVAGE? [yn] y 182863 files, 17282440 used, 120206472 free (12448 frags, 15024253 blocks, 0.0% fragmentation) * FILE SYSTEM MARKED CLEAN * * FILE SYSTEM WAS MODIFIED * The usual stuff I would say. How is this usual?. It appears to me you did have some filesystem corruption. What kind of filesystem corruption and how to solve that? That's difficult to answer, for a lot of reasons. Your original post stated that you were seeing g_vfs_done errors on the console, and you were worried about what they implied. Then someone asked you have you fsck'd the filesystem?, and you hadn't. Then you did fsck it, and as can be seen above, the filesystem had errors. When combined with your below comment, it's very difficult to figure out what's going on with your system over there, or what information you're not disclosing. Additionally, kris@ has stated that it looks like you may have a hard disk that's gone bad, and that's a strong possibility as well. SMART statistics of the drives in your RAID array would be useful. I see these messages frequently if a FreeBSD machine unexpectedly reboots. Not only on this system but also on others. I never worried about it. Are you saying the above errors experienced were caused by an unexpected crash or reboot? If so, the filesystem should have been automatically fsck'd shortly (60-120 seconds) after getting a login: prompt on the console. Is your filesystem UFS2 with softupdates enabled? If so, and the automatic fsck didn't happen, then that's something separate to look into -- it should happen automatically with softupdates enabled. More importantly, though, would be the explanation for why your system is crashing/rebooting/power-cycling. Data corruption can happen in those situations, especially the latter, but any form of non-clean shutdown should induce a fsck on UFS2+softupdate filesystems. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: g_vfs_done error third part--PLEASE HELP!
On Fri, May 16, 2008 at 02:14:14PM +0200, Willy Offermans wrote: Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ar0s1a 20308398 230438 18453290 1%/ devfs 11 0 100%/dev /dev/ar0s1d 21321454 3814482 1580125619%/usr /dev/ar0s1e 50777034 5331686 4138318611%/var /dev/ar0s1f 101554150 18813760 7461605820%/home /dev/ar0s1g 274977824 34564876 21841472414%/share pretty normal I would say. Yes. Did you notice any file corruption in the filesystem on ar0s1g? No the two disks are brand new and I did not encounter any noticeable file corruption. However I assume that nowadays bad sectors on HD are handled by the hardware and do not need any user interaction to correct this. But maybe I'm totally wrong. Every ATA disk has spare sectors, and they usually don't report bad blocks untill the spares are exhausted. In which case it is prudent to replace the disk. Unmount the filesystem and run fsck(8) on it. Does it report any errors? sun# fsck /dev/ar0s1g ** /dev/ar0s1g ** Last Mounted on /share ** Phase 1 - Check Blocks and Sizes INCORRECT BLOCK COUNT I=34788357 (272 should be 264) CORRECT? [yn] y INCORRECT BLOCK COUNT I=34789217 (296 should be 288) CORRECT? [yn] y ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? [yn] y SUMMARY INFORMATION BAD SALVAGE? [yn] y BLK(S) MISSING IN BIT MAPS SALVAGE? [yn] y 182863 files, 17282440 used, 120206472 free (12448 frags, 15024253 blocks, 0.0% fragmentation) * FILE SYSTEM MARKED CLEAN * * FILE SYSTEM WAS MODIFIED * The usual stuff I would say. Disk corruption is never normal. It can be explained by if the machine crashed or was power-cycles before the disks were unmounted, but it can also indicate hardware troubles. Any hints are very much appreciated. So I have to conclude that the write error message does make sense and that something seems to be wrong with the disks. The next question is what can I do about it? Should I return the disks to the shop and ask for new ones? Install sysutils/smartmontools, and run 'smartctl -A /dev/adX|less', where X are the numbers of the drives in the RAID array. In the output, look at the values for Reallocated_Sector_Ct, Current_Pending_Sector, Offline_Uncorrectable, which is the last number that you see on each line. A small number for Reallocated_Sector_Ct is allowable. But non-zero counts for Current_Pending_Sector or Offline_Uncorrectable means it's time to get a new disk. However other people that I have contacted and who had a similar problem before have solved it by using software raid setup instead of a hardware raid setup. This seems to indicate that there is some bug in the FreeBSD code. The RAID support that you find on most desktop motherboards _is_ software RAID. See ataraid(4). Roland -- R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) pgpNcSsEdIcRL.pgp Description: PGP signature
Re: g_vfs_done error third part--PLEASE HELP!
On Fri, Apr 25, 2008 at 07:59:36AM +0300, Toomas Aas wrote: Willy Offermans wrote: It is already the third time that I report this error. Can someone help me in solving this issue? Apr 21 19:44:36 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 Apr 21 19:45:07 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 Apr 21 19:45:38 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 ... I can only tell you that I had similar problems with FreeBSD 6.3 and ICH7R based RAID. Since I couldn't figure out how to solve them, I discarded the BIOS-based RAID and instead set up gmirror. It's been running this way for a year now and been rock solid. Are you referring to Intel MatrixRAID? If so, there are multiple PRs open on problems with FreeBSD and MatrixRAID, some which have been open for over 2 years which include patches. You wouldn't be the first person to ask why they haven't been committed to the tree. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: g_vfs_done error third part--PLEASE HELP!
Jeremy Chadwick wrote: On Fri, Apr 25, 2008 at 07:59:36AM +0300, Toomas Aas wrote: Willy Offermans wrote: Apr 21 19:44:36 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 Apr 21 19:45:07 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 Apr 21 19:45:38 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 ... I can only tell you that I had similar problems with FreeBSD 6.3 and ICH7R based RAID. Since I couldn't figure out how to solve them, I discarded the BIOS-based RAID and instead set up gmirror. It's been running this way for a year now and been rock solid. Are you referring to Intel MatrixRAID? Yes. If so, there are multiple PRs open on problems with FreeBSD and MatrixRAID, some which have been open for over 2 years which include patches. Funny that I didn't find them when I was investigating the problem. Not that I'm doubting your word, just... funny. You wouldn't be the first person to ask why they haven't been committed to the tree. Well, unfortunately I am not competent to comment on that, nor am I in postition to *demand* that something be committed in a volunteer project, since I couldn't even imagine what the consequences would be :) At least I found a workaround. -- Toomas Aas ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: g_vfs_done error third part--PLEASE HELP!
Willy Offermans wrote: It is already the third time that I report this error. Can someone help me in solving this issue? Apr 21 19:44:36 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 Apr 21 19:45:07 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 Apr 21 19:45:38 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 ... I can only tell you that I had similar problems with FreeBSD 6.3 and ICH7R based RAID. Since I couldn't figure out how to solve them, I discarded the BIOS-based RAID and instead set up gmirror. It's been running this way for a year now and been rock solid. -- Toomas Aas ... One way to be happy ever after is not to be after too much. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
g_vfs_done error third part--PLEASE HELP!
Dear FreeBSD friends, It is already the third time that I report this error. Can someone help me in solving this issue? Over and over again and always after heavy disk I/O I see the following errors in the log files. If I force ar0s1g to unmount the machine spontaneously reboots. Nothing seriously seems to be damaged by this act, but anyway I cannot afford something bad happening to this production machine. Currently the error is the following: snip ... Apr 21 19:44:36 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 Apr 21 19:45:07 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 Apr 21 19:45:38 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 ... /snip before the error appeared like: snip ... Apr 18 20:00:15 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725072896, length=2048)]error = 5 Apr 18 20:00:46 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725068800, length=4096)]error = 5 Apr 18 20:00:46 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725072896, length=2048)]error = 5 Apr 18 20:01:17 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725068800, length=4096)]error = 5 Apr 18 20:01:17 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725072896, length=2048)]error = 5 Apr 18 20:01:48 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725068800, length=4096)]error = 5 Apr 18 20:01:48 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725072896, length=2048)]error = 5 Apr 18 20:02:19 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725068800, length=4096)]error = 5 ... /snip I have no clue what the errors mean, since offsets of 290725068800, 290725072896, and 290725074944 seem to be ridiculous. Does anybody have a clue what is going on? I'm using FreeBSD 7.0, but found the error being reported before with previous versions of FreeBSD. I can and will provide more details on demand. Any hints are very much appreciated. -- Met vriendelijke groeten, With kind regards, Mit freundlichen Gruessen, De jrus wah, Willy * W.K. Offermans Home: +31 45 544 49 44 Mobile: +31 653 27 16 23 e-mail: [EMAIL PROTECTED] Powered by (__) \\\'',) \/ \ ^ .\._/_) www.FreeBSD.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: g_vfs_done error third part--PLEASE HELP!
On Mon, Apr 21, 2008 at 09:04:03PM +0200, Willy Offermans wrote: Dear FreeBSD friends, It is already the third time that I report this error. Can someone help me in solving this issue? Probably the reason that you hear so little is that you provide so little information. Most of us are not clairvoyant. Over and over again and always after heavy disk I/O I see the following errors in the log files. If I force ar0s1g to unmount the machine spontaneously reboots. Nothing seriously seems to be damaged by this act, but anyway I cannot afford something bad happening to this production machine. Why would you force an unmount? Apr 18 20:02:19 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725068800, length=4096)]error = 5 I have no clue what the errors mean, since offsets of 290725068800, 290725072896, and 290725074944 seem to be ridiculous. Does anybody have a clue what is going on? For starters, how big is ar0s1g? If the offset is in bytes, it is around 270 GB, which is not that unusual in this day and age. I'm using FreeBSD 7.0, but found the error being reported before with previous versions of FreeBSD. I can and will provide more details on demand. What does 'df' say? Did you notice any file corruption in the filesystem on ar0s1g? Unmount the filesystem and run fsck(8) on it. Does it report any errors? Any hints are very much appreciated. Did you manage to create a partition larger than the disk is (using newfs's -s switch)? In that case it could be that you're trying to write past the end of the device. Roland -- R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) pgpfrJ8nxl21I.pgp Description: PGP signature
RE: g_vfs_done error third part--PLEASE HELP!
Hi Willy, You seem to have emailed me directly as well as posting to the list. The bad offsets are probably because you have filesystem corruption, and the actual event that caused it was probably not reported (or is at least not reported by these errors). Basic question: Do you have a hardware problem? - Do you have ECC memory? If not, have you run memtest? - Are your disks reliable, or is one corrupting data? Less basic questions: What is the corruption, and what the cause? That might require a little more work and dropping into the debugger. You could also try reconfiguring to use gmirror instead of ar to see if that improves things (ie: it could be an ar bug). Regards, Jan. -Original Message- From: Willy Offermans [mailto:[EMAIL PROTECTED] Sent: Tuesday, 22 April 2008 5:04 AM To: freebsd-stable@FreeBSD.ORG Subject: g_vfs_done error third part--PLEASE HELP! Dear FreeBSD friends, It is already the third time that I report this error. Can someone help me in solving this issue? Over and over again and always after heavy disk I/O I see the following errors in the log files. If I force ar0s1g to unmount the machine spontaneously reboots. Nothing seriously seems to be damaged by this act, but anyway I cannot afford something bad happening to this production machine. Currently the error is the following: snip ... Apr 21 19:44:36 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 Apr 21 19:45:07 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 Apr 21 19:45:38 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725074944, length=2048)]error = 5 ... /snip before the error appeared like: snip ... Apr 18 20:00:15 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725072896, length=2048)]error = 5 Apr 18 20:00:46 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725068800, length=4096)]error = 5 Apr 18 20:00:46 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725072896, length=2048)]error = 5 Apr 18 20:01:17 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725068800, length=4096)]error = 5 Apr 18 20:01:17 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725072896, length=2048)]error = 5 Apr 18 20:01:48 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725068800, length=4096)]error = 5 Apr 18 20:01:48 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725072896, length=2048)]error = 5 Apr 18 20:02:19 sun kernel: g_vfs_done():ar0s1g[WRITE(offset=290725068800, length=4096)]error = 5 ... /snip I have no clue what the errors mean, since offsets of 290725068800, 290725072896, and 290725074944 seem to be ridiculous. Does anybody have a clue what is going on? I'm using FreeBSD 7.0, but found the error being reported before with previous versions of FreeBSD. I can and will provide more details on demand. Any hints are very much appreciated. -- Met vriendelijke groeten, With kind regards, Mit freundlichen Gruessen, De jrus wah, Willy * W.K. Offermans Home: +31 45 544 49 44 Mobile: +31 653 27 16 23 e-mail: [EMAIL PROTECTED] Powered by (__) \\\'',) \/ \ ^ .\._/_) www.FreeBSD.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]