Subvolume corruption after restart on Raid1 array

Kenneth Bogert Sat, 11 Feb 2017 09:35:13 -0800

Hello all,

I have been running a Rockstor 3.8.16-8 on an older Dell Optiplex for about a 
month.  The system has four drives separated into two Raid1 filesystems 
(“pools” in Rockstor terminology).  A few days ago I restarted it and noticed 
that the services (NFS, Samba, etc) weren’t working.  Looking at dmesg, I saw:


kernel: BTRFS error (device sdb): parent transid verify failed on 1721409388544 
wanted 19188 found 83121

and sure enough, one of the subvolumes on my main filesystem is corrupted.  By 
corrupted I mean it can’t be accessed, deleted, or even looked at:

ls -l
kernel: BTRFS error (device sdb): parent transid verify failed on 1721409388544 
wanted 19188 found 83121
kernel: BTRFS error (device sdb): parent transid verify failed on 1721409388544 
wanted 19188 found 83121
ls: cannot access /mnt2/Primary/Movies: Input/output error

total 16
drwxr-xr-x 1 root      root         100 Dec 29 02:00 .
drwxr-xr-x 1 root      root         208 Jan  3 12:05 ..
drwxr-x--- 1 kbogert   root         698 Feb  6 08:49 Documents
drwxr-xrwx 1 root      root         916 Jan  3 12:54 Games
drwxr-xrwx 1 xenserver xenserver   2904 Jan  3 12:54 ISO
d????????? ? ?         ?              ?            ? Movies
drwxr-xrwx 1 root      root      139430 Jan  3 12:53 Music
drwxr-xrwx 1 root      root       82470 Jan  3 12:53 RawPhotos
drwxr-xr-x 1 root      root          80 Jan  1 04:00 .snapshots
drwxr-xrwx 1 root      root          72 Jan  3 13:07 VMs

The input/output error is given for any operation on Movies.

Luckily there has been no data loss that I am aware of.  As it turns out I have 
a snapshot of the Movies subvolume taken a few days before the incident.  I was 
able to simply cp -a all files off of the entire filesystem, with no reported 
errors, and verified a handful of them.  Note that the transid error in dmesg 
alternates between sdb and sda5 after each startup.


SETUP DETAILS

uname -a
Linux ironmountain 4.8.7-1.el7.elrepo.x86_64 #1 SMP Thu Nov 10 20:47:24 EST 
2016 x86_64 x86_64 x86_64 GNU/Linux

btrfs —version
btrfs-progs v4.8.3

btrfs dev scan
kernel: BTRFS: device label Primary devid 1 transid 83461 /dev/sdb
kernel: BTRFS: device label Primary devid 2 transid 83461 /dev/sda5

btrfs fi show /mnt2/Primary
Label: 'Primary'  uuid: 21e09dd8-a54d-49ec-95cb-93fdd94f0c17
        Total devices 2 FS bytes used 943.67GiB
        devid    1 size 2.73TiB used 947.06GiB path /dev/sdb
        devid    2 size 2.70TiB used 947.06GiB path /dev/sda5

btrfs dev usage /mnt2/Primary
/dev/sda5, ID: 2
   Device size:             2.70TiB
   Device slack:              0.00B
   Data,RAID1:            944.00GiB
   Metadata,RAID1:          3.00GiB
   System,RAID1:           64.00MiB
   Unallocated:             1.77TiB

/dev/sdb, ID: 1
   Device size:             2.73TiB
   Device slack:              0.00B
   Data,RAID1:            944.00GiB
   Metadata,RAID1:          3.00GiB
   System,RAID1:           64.00MiB
   Unallocated:             1.80TiB


btrfs fi df /mnt2/Primary
Data, RAID1: total=944.00GiB, used=942.60GiB
System, RAID1: total=64.00MiB, used=176.00KiB
Metadata, RAID1: total=3.00GiB, used=1.07GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


This server is very light use, however, I do have a number of VMs in the VMs 
filesystem, exported over NFS, that are used by a Xenserver.  These are not 
marked nocow, though I probably should have.  At the time of restart no VMs 
were running.

I have deviated from Rockstor’s default setup a bit.  They take an “appliance” 
view and try to enforce btrfs partitions that cover entire disks.  I installed 
Rockstor onto /dev/sda4, created the Primary partition on /dev/sdb using 
Rockstor’s gui, then on the command line added /dev/sda5 to it and converted to 
raid1.  As far as I can tell Rockstor is just CentOS 7 with a few updated 
utilities and a bunch of python scripts for providing a web interface to 
btrfs-progs.  I have it setup to take monthly snapshots and do monthly scrubs, 
with the exception of the Documents subvolume which takes daily snapshots.  
These are all readonly and go in the .snapshots directory.  Rockstor 
automatically deletes old snapshots once a limit is reached (7 daily snapshots, 
for instance).

Side note, btrfs-progs 4.8.3 apparently has problems with CentOS 7’s glibc: 
https://github.com/rockstor/rockstor-core/issues/1608 .  I have confirmed that 
bug in my own compiled version of 4.8.3, and that 4.9.1 does not have it.


WHAT I’VE TRIED AND RESULTS

First off, I have created an image with btrfs-image that I can make available 
(though large, I believe it was a few Gbs and the filesystem is 3 TB)

* btrfs-zero-log 
        had no discernible effect.


* At this point, I compiled btrfs-progs 4.9.1.  The following commands were run 
with this version:


* btrfs check
        This exits in an assert fairly quickly:
checking extents
cmds-check.c:5406: check_owner_ref: BUG_ON `rec->is_root` triggered, value 1
/mnt/usb/btrfs-progs-bin/bin/btrfs[0x42139b]
/mnt/usb/btrfs-progs-bin/bin/btrfs[0x421483]
/mnt/usb/btrfs-progs-bin/bin/btrfs[0x430529]
/mnt/usb/btrfs-progs-bin/bin/btrfs[0x43160c]
/mnt/usb/btrfs-progs-bin/bin/btrfs[0x435d6f]
/mnt/usb/btrfs-progs-bin/bin/btrfs[0x43ab71]
/mnt/usb/btrfs-progs-bin/bin/btrfs[0x43b065]
/mnt/usb/btrfs-progs-bin/bin/btrfs(cmd_check+0xbbc)[0x441b82]
/mnt/usb/btrfs-progs-bin/bin/btrfs(main+0x12b)[0x40a734]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff6fa7b35]
/mnt/usb/btrfs-progs-bin/bin/btrfs[0x40a179]

Full backtrace is attached as btrfsck_debug.log 

* btrfs check -mode lowmem
        This outputs a large number of errors before finally segfault’ing.  
Full backtrace attached as btrfsck_lowmem_debug.log

* btrfs scrub
        This completes with no errors.


* Memtest86 completed more than 6 passes with no errors (left it running for a 
day)

* No SMART errors, btrfs device stats shows no errors.  The drives the 
filesystem is on are brand new.

* I have tried to recreate the problem by installing Rockstor into a number of 
VMs and redoing my steps, no such luck.


The main Rockstor partition (btrfs), as well as the other Raid1 partition on 
completely separate drives were not affected.  I can provide any other logs 
requested.

Help would be greatly appreciated!


Kenneth Bogert

btrfsck_lowmem_debug.log
Description: Binary data

btrfsck_debug.log
Description: Binary data

Subvolume corruption after restart on Raid1 array

Reply via email to