Hi there, I have met a dramatic issue on my Linux (Ubuntu 13.10) box running ZFS as its root filesystem (zfsonlinux), and I'm afraid all my data is lost (but I would do about anything for getting it back...)
I have a ZFS pool made out of a disk mirror (sda3, sdb3) plus a L2 cache out of an SSD (sdd4). I the past, when scrubbing the pool, I happened to get some errors, mostly on sda, which zfs "fixed". I was puzzled by that because the disks SMART says "no errors whatsoever", disks SMART tests pass OK , syslog doesn't record interface errors, system memory (non-ECC) passes Memtest86+, system never malfunctions... so there's no visible issue except for ZFS recording (and fixing ?) errors while scrubbing. I decided to "live whith this for a while because no money for replacing a working HD"... (Yes, people before you tell me go get other disks, another mobo, another PSU... I'm really straight out of cash... It's not an option.) At some point in time my box PSU died, and as I needed my system I just dropped the 2 disks in another box, and it kept on working (Linux magic). I took advantage of that to perform another couple scrubs with the new box, and it gave about the same results (so the issue lies either with the disks or ZFS software ?) I eventually got another PSU for my initial system to repair, and dropped the disks back in. I messed a bit with the SATA cables and drives order, and as Linux doesn't seem to be able to use drives IDs, but devices names, for a root pool (too bad...) my system happened to come up with a degraded mirror on a single disk (sda3, missing sdb3). But OK. I turned the system off, fiddled with the cables, restarted, I reinserted sdb3, and then it became to resilver. After a day it eventually finished, but resilvering had noticed about 120 "Checksum errors" on sda, and about 10 on sdb. It said that the system had found an uncorrectable error, identifying it something like <metadata>: <00x> Still, it was working but I didn't know how to clear this seemingly minor error. Turned the system off. The next day the system booted OK, but still started immediately to "resilver" again, still showing quite the same amount of errors as usual. But at some point the system completely hanged, leaving me no other choice than pulling the power cord. Since, my system won't boot at all. Trying to mount the root pool ends in the following kernel rude words you can see here: https://www.dropbox.com/s/ggrl2148t9brehh/P1030505.JPG https://www.dropbox.com/s/sm0hfmpjy63emj4/P1030506.JPG I tried to boot an Ubuntu live USB, then install ZFS and import the pool, with the same result. I got the same system crashes trying to import in FreeBSD : https://www.dropbox.com/s/f2jtg864jzut6o5/P1030508.JPG And even it crashed OpenIndiana (so fast that I could only take a blurry pic): https://www.dropbox.com/s/03w81p1xshekb79/P1030511.JPG I'm currently getting some help and the issue is being worked on here: https://github.com/zfsonlinux/spl/issues/329 ....Where you can find links to other pics with debugging output, the last one so far ending in error...: https://www.dropbox.com/s/jdxfn6zq9ffv02q/P1030517.JPG I'd love to be able to rescue data from this pool, as there's about 900GB there, part of it being "not easily replaceable"... (Read: I don't backup things I can live without, but would love to get them back ;-) Any help will be warmly welcomed and highly appreciated. It's a real bad situation :-( TIA -- Swâmi Petaramesh <[email protected]> http://petaramesh.org PGP 9076E32E _______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
