Hi there,

I have met a dramatic issue on my Linux (Ubuntu 13.10) box running ZFS as its 
root filesystem (zfsonlinux), and I'm afraid all my data is lost (but I would 
do about anything for getting it back...)

I have a ZFS pool made out of a disk mirror (sda3, sdb3) plus a L2 cache out 
of an SSD (sdd4).

I the past, when scrubbing the pool, I happened to get some errors, mostly on 
sda, which zfs "fixed".

I was puzzled by that because the disks SMART says "no errors whatsoever", 
disks SMART tests pass OK , syslog doesn't record interface errors, system 
memory (non-ECC) passes Memtest86+, system never malfunctions... so there's no 
visible issue except for ZFS recording (and fixing ?) errors while scrubbing.

I decided to "live whith this for a while because no money for replacing a 
working HD"... (Yes, people before you tell me go get other disks, another 
mobo, another PSU... I'm really straight out of cash... It's not an option.)

At some point in time my box PSU died, and as I needed my system I just 
dropped the 2 disks in another box, and it kept on working (Linux magic).

I took advantage of that to perform another couple scrubs with the new box, 
and it gave about the same results (so the issue lies either with the disks or 
ZFS software ?)

I eventually got another PSU for my initial system to repair, and dropped the 
disks back in.

I messed a bit with the SATA cables and drives order, and as Linux doesn't 
seem to be able to use drives IDs, but devices names, for a root pool (too 
bad...) my system happened to come up with a degraded mirror on a single disk 
(sda3, missing sdb3). But OK.

I turned the system off, fiddled with the cables, restarted, I reinserted sdb3, 
and then it became to resilver.

After a day it eventually finished, but resilvering had noticed about 120 
"Checksum errors" on sda, and about 10 on sdb. It said that the system had 
found an uncorrectable error, identifying it something like <metadata>: <00x>

Still, it was working but I didn't know how to clear this seemingly minor 
error.

Turned the system off.

The next day the system booted OK, but still started immediately to "resilver" 
again, still showing quite the same amount of errors as usual.

But at some point the system completely hanged, leaving me no other choice 
than pulling the power cord.


Since, my system won't boot at all. Trying to mount the root pool ends in the 
following kernel rude words you can see here:

https://www.dropbox.com/s/ggrl2148t9brehh/P1030505.JPG
https://www.dropbox.com/s/sm0hfmpjy63emj4/P1030506.JPG

I tried to boot an Ubuntu live USB, then install ZFS and import the pool, with 
the same result.

I got the same system crashes trying to import in FreeBSD :
https://www.dropbox.com/s/f2jtg864jzut6o5/P1030508.JPG

And even it crashed OpenIndiana (so fast that I could only take a blurry pic):
https://www.dropbox.com/s/03w81p1xshekb79/P1030511.JPG

I'm currently getting some help and the issue is being worked on here:
https://github.com/zfsonlinux/spl/issues/329

....Where you can find links to other pics with debugging output, the last one 
so far ending in error...:
https://www.dropbox.com/s/jdxfn6zq9ffv02q/P1030517.JPG

I'd love to be able to rescue data from this pool, as there's about 900GB 
there, part of it being "not easily replaceable"... (Read: I don't backup 
things I can live without, but would love to get them back ;-)

Any help will be warmly welcomed and highly appreciated. It's a real bad 
situation :-(

TIA

-- 
Swâmi Petaramesh <[email protected]> http://petaramesh.org PGP 9076E32E
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to