[zfs-discuss] Permanent errors
Hello ! Can anybody help me with some trouble: j...@opensolaris:~# zpool status -v pool: green state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub in progress for 0h8m, 5.08% done, 2h42m to go config: NAMESTATE READ WRITE CKSUM green ONLINE 0 0 117 c3d0 ONLINE 0 0 234 c4d0 ONLINE 0 0 234 errors: Permanent errors have been detected in the following files: /green/home/epiq/vid/2resolume/400x300/people/deti/primorsk - manoly - rebenok na vode.avi /green/home/epiq/mus/!!![Labels]!!!/playhouse/[playcd007] captain comatose - going out/11_captain comatose - $100 (live).mp3 green/home:0x1451a green/home:0x1cd29 green/home:0x14537 green/home:0x1454e green/home:0x14577 green/home:0x1458e green/home:0x14599 /green/home/epiq/vid/!!!incoming/Mar`ja_Iskusnica.avi /green/home/epiq/mus/easy green/home:0x144ec green/home:0x144f9 green/home:0x144fc As you can see in output my zfs pool have permanent errors with some files and directories - but how i can clear these errors? I try to do scrub one time, and bigger part of errors gone, but some stay, and i can't delete it : j...@opensolaris:~# rm -rf /green/home/epiq/mus/easy rm: Unable to remove directory /green/home/epiq/mus/easy: Directory not empty j...@opensolaris:~# rmdir --ignore-fail-on-non-empty /green/home/epiq/mus/easy rmdir: illegal option -- ignore-fail-on-non-empty Usage: rmdir [-ps] dirname ... j...@opensolaris:~# rm -rf /green/home/epiq/vid/2resolume/400x300/people/deti/primorsk - manoly - rebenok na vode.avi j...@opensolaris:~# ls -la /green/home/epiq/vid/2resolume/400x300/people/deti/primorsk - manoly - rebenok na vode.avi -rw-r--r-- 1 101 staff18221286 Oct 25 2008 /green/home/epiq/vid/2resolume/400x300/people/deti/primorsk - manoly - rebenok na vode.avi While i googled around this problem i found zdb commands: j...@opensolaris:~# zdb -d green/home 0x1451a 0x1cd29 0x14537 0x1454e 0x14577 0x1458e 0x14599 0x144ec 0x144f9 0x144fc Dataset green/home [ZPL], ID 40, cr_txg 325, 906G, 116136 objects Object lvl iblk dblk dsize lsize %full type 83226216K 128K 12.6M 12.6M 100.00 ZFS plain file 118057416K 128K 5.22G 5.22G 100.00 ZFS plain file 83255216K 128K 7.63M 7.62M 100.00 ZFS plain file 83278216K 128K 4.25M 4.25M 100.00 ZFS plain file 83319216K 128K 5.38M 5.38M 100.00 ZFS plain file 83342216K 128K 5.50M 5.50M 100.00 ZFS plain file 83353216K 128K 6.13M 6.12M 100.00 ZFS plain file 83180216K 128K 8.26M 8.25M 100.00 ZFS plain file 83193216K 128K 6.01M 6.00M 100.00 ZFS plain file 83196216K 128K 8.63M 8.62M 100.00 ZFS plain file , but how it can halp me? With best wishes, Epiq. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanent errors
Hi-- The best approach is to correct the issues that are causing these problems in the first place. The fmdump -eV commnand will identify the hardware problems that caused the checksum errors and the corrupted files. You might be able to use some combination of zpool scrub, zpool clear, and removing these corrupted files manually, but unless the failing disks (?) are repaired, you could potentially lose more data. We have zpool recovery feature in build 128 but it will not work in this scenario since the pool is online (its importable). If you can't repair the underlying hardware issues and you have no backup of this data, then you might consider reviewing the steps that are described here: http://www.solarisinternals.com/wiki/index.php/ZFS_forensics_scrollback_script We haven't had a chance to look at this script closely or test it yet so the usual caveats apply. Thanks, Cindy On 01/12/10 07:11, epiq wrote: Hello ! Can anybody help me with some trouble: j...@opensolaris:~# zpool status -v pool: green state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub in progress for 0h8m, 5.08% done, 2h42m to go config: NAMESTATE READ WRITE CKSUM green ONLINE 0 0 117 c3d0 ONLINE 0 0 234 c4d0 ONLINE 0 0 234 errors: Permanent errors have been detected in the following files: /green/home/epiq/vid/2resolume/400x300/people/deti/primorsk - manoly - rebenok na vode.avi /green/home/epiq/mus/!!![Labels]!!!/playhouse/[playcd007] captain comatose - going out/11_captain comatose - $100 (live).mp3 green/home:0x1451a green/home:0x1cd29 green/home:0x14537 green/home:0x1454e green/home:0x14577 green/home:0x1458e green/home:0x14599 /green/home/epiq/vid/!!!incoming/Mar`ja_Iskusnica.avi /green/home/epiq/mus/easy green/home:0x144ec green/home:0x144f9 green/home:0x144fc As you can see in output my zfs pool have permanent errors with some files and directories - but how i can clear these errors? I try to do scrub one time, and bigger part of errors gone, but some stay, and i can't delete it : j...@opensolaris:~# rm -rf /green/home/epiq/mus/easy rm: Unable to remove directory /green/home/epiq/mus/easy: Directory not empty j...@opensolaris:~# rmdir --ignore-fail-on-non-empty /green/home/epiq/mus/easy rmdir: illegal option -- ignore-fail-on-non-empty Usage: rmdir [-ps] dirname ... j...@opensolaris:~# rm -rf /green/home/epiq/vid/2resolume/400x300/people/deti/primorsk - manoly - rebenok na vode.avi j...@opensolaris:~# ls -la /green/home/epiq/vid/2resolume/400x300/people/deti/primorsk - manoly - rebenok na vode.avi -rw-r--r-- 1 101 staff18221286 Oct 25 2008 /green/home/epiq/vid/2resolume/400x300/people/deti/primorsk - manoly - rebenok na vode.avi While i googled around this problem i found zdb commands: j...@opensolaris:~# zdb -d green/home 0x1451a 0x1cd29 0x14537 0x1454e 0x14577 0x1458e 0x14599 0x144ec 0x144f9 0x144fc Dataset green/home [ZPL], ID 40, cr_txg 325, 906G, 116136 objects Object lvl iblk dblk dsize lsize %full type 83226216K 128K 12.6M 12.6M 100.00 ZFS plain file 118057416K 128K 5.22G 5.22G 100.00 ZFS plain file 83255216K 128K 7.63M 7.62M 100.00 ZFS plain file 83278216K 128K 4.25M 4.25M 100.00 ZFS plain file 83319216K 128K 5.38M 5.38M 100.00 ZFS plain file 83342216K 128K 5.50M 5.50M 100.00 ZFS plain file 83353216K 128K 6.13M 6.12M 100.00 ZFS plain file 83180216K 128K 8.26M 8.25M 100.00 ZFS plain file 83193216K 128K 6.01M 6.00M 100.00 ZFS plain file 83196216K 128K 8.63M 8.62M 100.00 ZFS plain file , but how it can halp me? With best wishes, Epiq. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanent errors
Cindys, thank you for answer, but i need explain some details. This pool is new hardware for my system - 2x1Tb WD Green hard drives, but data on this pool was copied from old 9x300 Gb hard drives pool with hw problem. while i copied it data where was many errors, but at the end i see this picture : j...@opensolaris:~# fmdump -eV fmdump: failed to open /var/fm/fmd/errlog: No such file or directory and where no READ or WRITE errors on new pool - only CKSUM. As i understand it closely to transfer errors, then problem with new HW. And now i need only to clear this permanent errors on new pool, for probably restoring this files from backups. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanent errors
Hi, I think you are saying that you copied the data on this system from a previous system with hardware problems. It looks like the data that was copied was corrupt, which is causing the permanent errors on the new system (?) The manual removal of the corrupt files, zpool scrub and zpool clear might work, but I don't have experience with this many errors on a non-redundant config. If you have a clean backup of this data, you might consider destroying the green/home dataset, clearing the pool errors, recreating green/home, and restoring the known, good data. Maybe someone else can suggest a better approach... Cindy On 01/12/10 15:27, epiq wrote: Cindys, thank you for answer, but i need explain some details. This pool is new hardware for my system - 2x1Tb WD Green hard drives, but data on this pool was copied from old 9x300 Gb hard drives pool with hw problem. while i copied it data where was many errors, but at the end i see this picture : j...@opensolaris:~# fmdump -eV fmdump: failed to open /var/fm/fmd/errlog: No such file or directory and where no READ or WRITE errors on new pool - only CKSUM. As i understand it closely to transfer errors, then problem with new HW. And now i need only to clear this permanent errors on new pool, for probably restoring this files from backups. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanent errors on two files
On Fri, Dec 04, 2009 at 02:52:47PM -0700, Cindy Swearingen wrote: If space/dcc is a dataset, is it mounted? ZFS might not be able to print the filenames if the dataset is not mounted, but I'm not sure if this is why only object numbers are displayed. Yes, it's mounted and is quite an active filesystem. I would also check fmdump -eV to see how frequent the hardware has had problems. That shows ZFS checksum errors in July, but nothing since that time. There were also DIMM errors before that, starting in June. We replaced the failed DIMMs, also in July. This is an X4450 with ECC memory. There were no disk errors reported. I suppose we can blame the memory. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanent errors on two files
On Sat, Dec 05, 2009 at 01:52:12AM +0300, Victor Latushkin wrote: On Dec 5, 2009, at 0:52, Cindy Swearingen cindy.swearin...@sun.com wrote: The zpool status -v command will generally print out filenames, dnode object numbers, or identify metadata corruption problems. These look like object numbers, because they are large, rather than metadata objects, but an expert will have to comment. Yes, thi is object numbers and most likely reason these are not turned into filnames is that corresponding files no longer exist. That seems to be the case: # zdb -d space/dcc 0x11e887 0xba25aa Dataset space/dcc [ZPL], ID 21, cr_txg 19, 20.5G, 3672408 objects So I'd run scrub another time, if the files are gone and there are no other corruptions scrub will reset error log and zpool status should become clean. That worked. After the scrub, there are no errors reported. You might be able to identify these object numbers with zdb, but I'm not sure how do that. You can try to use zdb this way to check if these objects still exist zdb -d space/dcc 0x11e887 0xba25aa -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanent errors on two files
Hi Gary, To answer your questions, the hardware read some data and ZFS detected a problem with the checksums in this dataset and reported this problem. ZFS can do this regardless of ZFS redundancy. I don't think a scrub will fix these permanent errors, but it depends on the corruption. If its data, but not redundant and no copies=2, then probably not. If its metadata, then multiple copies exist, but it depends on the extent of the corruption. If space/dcc is a dataset, is it mounted? ZFS might not be able to print the filenames if the dataset is not mounted, but I'm not sure if this is why only object numbers are displayed. The zpool status -v command will generally print out filenames, dnode object numbers, or identify metadata corruption problems. These look like object numbers, because they are large, rather than metadata objects, but an expert will have to comment. You might be able to identify these object numbers with zdb, but I'm not sure how do that. I would also check fmdump -eV to see how frequent the hardware has had problems. Cindy On 12/04/09 12:19, Gary Mills wrote: I just noticed this today: # zpool status -v pool: space state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM space ONLINE 0 0 0 c0t1d0ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: space/dcc:0x11e887 space/dcc:0xba25aa The device here is a hardware mirror of two 146-gig SAS drives. How can ZFS detect errors when it has no redundancy? How do I determine what files these are? Will a scrub fix it? This is a production system, so I want to be careful. It's running Solaris 10 5/09 s10x_u7wos_08 X86. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanent errors on two files
On Dec 5, 2009, at 0:52, Cindy Swearingen cindy.swearin...@sun.com wrote: Hi Gary, To answer your questions, the hardware read some data and ZFS detected a problem with the checksums in this dataset and reported this problem. ZFS can do this regardless of ZFS redundancy. I don't think a scrub will fix these permanent errors, but it depends on the corruption. If its data, but not redundant and no copies=2, then probably not. If its metadata, then multiple copies exist, but it depends on the extent of the corruption. If space/dcc is a dataset, is it mounted? ZFS might not be able to print the filenames if the dataset is not mounted, but I'm not sure if this is why only object numbers are displayed. The zpool status -v command will generally print out filenames, dnode object numbers, or identify metadata corruption problems. These look like object numbers, because they are large, rather than metadata objects, but an expert will have to comment. Yes, thi is object numbers and most likely reason these are not turned into filnames is that corresponding files no longer exist. So I'd run scrub another time, if the files are gone and there are no other corruptions scrub will reset error log and zpool status should become clean. You might be able to identify these object numbers with zdb, but I'm not sure how do that. You can try to use zdb this way to check if these objects still exist zdb -d space/dcc 0x11e887 0xba25aa Victor I would also check fmdump -eV to see how frequent the hardware has had problems. Cindy On 12/04/09 12:19, Gary Mills wrote: I just noticed this today: # zpool status -v pool: space state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM space ONLINE 0 0 0 c0t1d0ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: space/dcc:0x11e887 space/dcc:0xba25aa The device here is a hardware mirror of two 146-gig SAS drives. How can ZFS detect errors when it has no redundancy? How do I determine what files these are? Will a scrub fix it? This is a production system, so I want to be careful. It's running Solaris 10 5/09 s10x_u7wos_08 X86. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Permanent errors on filesystem (opensolaris 2008.05)
Hi I am looking for guidance on the following zfs setup and error: - opensolaris 2008.05 running as guest in vmware server - ubuntu host - system has run flawlessly as an NFS file server for some months now. Single zpool (called 'tank'), 2 vdevs each as raid-Z, about 10 filesystems (one of them called 'mail') - after a power surge causing a reboot, opensolaris became unable to mount the pool Using the opensolaris cd as a rescue disk, I discovered a permanent error (ZFS-8000-8A) quoting tank/mail:0x0 as the location of the error (that is the name of the filesystem itself and not a specific file). The FS contains maidirs archives, probably of the order of 10,000 files. The pool comes out clean of a scrub. Googling, I tried to unmount / mount to possibly replay the log (ZIL) in case the transactions didn't play through entirely. Same negative result. Reading http://docs.sun.com/app/docs/doc/819-5461/gbbwl?a=view, there is a case mentioning monkey/dnode:0x0 that seems close enough. Is that really the case? If so, how do I 'move' the data as the solution proposes? As you imagine, I'd like to rescue files so any alternative hint is welcome. Thanks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanent errors on filesystem (opensolaris 2008.05)
Reading through the post the error message didn't come through properly. It is tank/mail:0x0 (with lesser than and greater than on either sides of the 0's). Also, the 4 disks (2 vdevs x 2 for raid-z) are physical sata disks dedicated to the vmware image. Thanks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss