On 2015-10-08 04:28, Pavel Pisa wrote:
Check your BIOS options, there should be some option to set SATA ports as either 'Hot-Plug' or 'External', which should allow you to hot-plug drives without needing a reboot (unless it's a Dell system, they have never properly implemented the SATA standard on their desktops).Hello everybody,On Monday 05 of October 2015 22:26:46 Pavel Pisa wrote:Hello everybody,...BTRFS has recognized appearance of its partition (even that hanged from sdb5 to sde5 when disk "hotplugged" again). But it seems that RAID1 components are not in sync and BTRFS continues to report BTRFS: lost page write due to I/O error on /dev/sde5 BTRFS: bdev /dev/sde5 errs: wr 11021805, rd 8526080, flush 29099, corrupt 0, gen I have tried to find the best way to resync RAID1 BTRFS partitions. But problem is that filesystem is the root one of the system. So reboot to some rescue media is required to run btrfsck --repair which is intended for unmounted devices. What is behavior of BTRFS in this situation? Is BTRFS able to use data from not up to date partition in these cases where data in respective files have not been modified? The main reason for question is if such (stable) data can be backuped by out of sync partition in the case of some random block is wear out on another device. Or is this situation equivalent to running with only one disk? Are there some parameters/solution to run some command (scrub balance) which makes devices to be in the sync again without unmount or reboot? I believe than attaching one more drive and running "btrfs replace" would solve described situation. But is there some equivalent to run operation "inplace".It seems that SATA controller is not able to activate link which has not been connected at BIOS POST time. This means that I cannot add new drive without reboot.
Even aside from the below mentioned issues, if your disk is showing that many errors, you should probably run a SMART self-test routine on it to determine whether this is just a transient issue or an indication of an impending disk failure. The commands I'd suggest are:Before reboot, the server bleeds with messages BTRFS: bdev /dev/sde5 errs: wr 11715459, rd 8526080, flush 29099, corrupt 0, gen 0 BTRFS: lost page write due to I/O error on /dev/sde5 BTRFS: bdev /dev/sde5 errs: wr 11715460, rd 8526080, flush 29099, corrupt 0, gen 0 BTRFS: lost page write due to I/O error on /dev/sde5
smartctl -t short /dev/sdeThat will tell you some time to wait for the test to complete, after waiting that long, run:
smartctl -H /dev/sdeIf that says the health check failed, replace the disk as soon as possible, and don't use it for storing any data you can't afford to lose.
that changed to next mesages after reboot Btrfs loaded BTRFS: device label riki-pool devid 1 transid 282383 /dev/sda3 BTRFS: device label riki-pool devid 2 transid 249562 /dev/sdb5 BTRFS info (device sda3): disk space caching is enabled BTRFS (device sda3): parent transid verify failed on 44623216640 wanted 263476 found 212766 BTRFS (device sda3): parent transid verify failed on 45201899520 wanted 282383 found 246891 BTRFS (device sda3): parent transid verify failed on 45202571264 wanted 282383 found 246890 BTRFS (device sda3): parent transid verify failed on 45201965056 wanted 282383 found 246889 BTRFS (device sda3): parent transid verify failed on 45202505728 wanted 282383 found 246890 BTRFS (device sda3): parent transid verify failed on 45202866176 wanted 282383 found 246890 BTRFS (device sda3): parent transid verify failed on 45207126016 wanted 282383 found 246894 BTRFS (device sda3): parent transid verify failed on 45202522112 wanted 282383 found 246890 BTRFS: bdev /dev/disk/by-uuid/1627e557-d063-40b6-9450-3694dd1fd1ba errs: wr 11723314, rd 8526080, flush 2 BTRFS (device sda3): parent transid verify failed on 45206945792 wanted 282383 found 67960 BTRFS (device sda3): parent transid verify failed on 45204471808 wanted 282382 found 67960 which looks really frightening to me. Temporary disconnected drive has old transid at start (OK). But what means the rest of the lines. If it means that files with older transaction ID are used from temporary disconnected drive (now /dev/sdb5) and newer versions from /dev/sda3 are ignored and reported as invalid then this means severe data lost and may it be mitchmatch because all transactions after disk disconnect are lost (i.e. FS root has been taken from misbehaving drive at old version). BTRFS does not fall even to red-only/degraded mode after system restart.
This actually surprises me.
I would be very careful in that situation, you may still have issues, at the very least, make a backup of the system as soon as possible.On the other hand, from logs (all stored on the possibly damaged root FS) it seems that there there are not missing messages from days when discs has been out of sync, so it looks like all data are OK. So should I expect that BTRFS managed problems well and all data are consistent?
As of right now, there is no way that I know of to safely re-sync a drive that's been disconnected for a while. The best bet is probably to use replace, but for that to work reliably, you would need to tell it to ignore the now stale drive when trying to read each chunk.I go to use "btrfs replace" because there has not been any reply to my inplace correction question. But I expect that clarification if possible/how to resync RAID1 after one drive temporal disappear is really important to many of BTRFS users.
It is theoretically possible to wipe the FS signature on the out-of sync drive, run a device scan, then run 'replace missing' pointing at the now 'blank' device, although going that route is really risky.
smime.p7s
Description: S/MIME Cryptographic Signature