On 2015-11-09 05:56, Anand Jain wrote:
I've thrown everything I can think of at this over the weekend, and nothing broke (at least, nothing broke that had anything to do with these patches, I ended up triggering a couple of known bugs that I had completely forgotten about), so you can add:These set of patches provides btrfs hot spare and auto replace support for you review and comments.First, here below are the simple example steps to configure the same: Add a spare device: btrfs spare add /dev/sde -f OR if there is a spare device which is already added before the, just run btrfs dev scan [/dev/sde] this will register the spare device to the kernel. btrfs fi show Label: none uuid: 52f170c1-725c-457d-8cfd-d57090460091 Total devices 2 FS bytes used 112.00KiB devid 1 size 2.00GiB used 417.50MiB path /dev/sdc devid 2 size 2.00GiB used 417.50MiB path /dev/sdd Global spare device size 3.00GiB path /dev/sde Thats it. Auto replace: Replace happens automatically, that is when there is any write failed or flush failed, the device will be marked as failed, which will stop any further IO attempt to that device. And in the next commit thread cycle the auto replace will pick the spare device (/dev/sde is above example) to replace the failed device. And so the btrfs volume is back to a healthy state. Its btrfs Global spare: as of now only global hot spare is supported, that is hot spare(s) are for all the btrfs FS in the system. No spare when device failed: It would scan for spare device at the rate of transaction commit and will trigger the auto replace when ever spare device is added. Priority: In some future work there can be some chronological order to pick a spare and the failed device. Patches: Kernel: First, it needs, Qu's per chunk missing device patchset, which is part of the set here and also there is a light optimization (patch 5/15) which was required as part of this enhancement. Next patches 7,8/15 brings in support, to manage the transition of devices from online (no state) to offline OR failed state dynamically. On top of static device state like the current "missing" state. Patch 9/15 fixes a bug where in we should have blocked the incompatible feature at the device scan/add level instead/also at in the mount level. This is because we don't have to bring a device into the device list, if it is incompatible. Next patches 10,11,12,13/15 adds support for Spare device. For the details on how to add a spare device kindly see further below. For kernel with out spare feature supported the spare device is kept away. And when the kernel supports the spare device, it will inhibit from mounting it. Further these patch set provides helper function to pick a spare device and release a spare device back to the spare device pool. Patch 14/15 provides function for auto replace, this is mainly from the existing replace code, and in the long run I see opportunity to merge these code with the replace code that is triggered from the user spare. Last 15/15, uses all these facilities, picks a failed device and triggers a auto replace in a kthread (casualty_kthread()) Progs: Would need 4 patches as listed below. Known Bug: As now I see below stale kmem cache during module unload. Which I am digging. ------ BUG btrfs_path (Not tainted): Objects remaining in btrfs_path on kmem_cache_close() ------ Anand Jain (10): btrfs: optimize btrfs_check_degradable() for calls outside of barrier btrfs: introduce device dynamic state transition to offline or failed btrfs: check device for critical errors and mark failed btrfs: block incompatible optional features at scan btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV btrfs: add check not to mount a spare device btrfs: support btrfs dev scan for spare device btrfs: provide framework to get and put a spare device btrfs: introduce helper functions to perform hot replace btrfs: check for failed device and hot replace Qu Wenruo (5): btrfs: Introduce a new function to check if all chunks a OK for degraded mount btrfs: Do per-chunk check for mount time check btrfs: Do per-chunk degraded check for remount btrfs: Allow barrier_all_devices to do per-chunk device check btrfs: Cleanup num_tolerated_disk_barrier_failures fs/btrfs/ctree.h | 7 +- fs/btrfs/dev-replace.c | 116 ++++++++++++++++++++ fs/btrfs/dev-replace.h | 1 + fs/btrfs/disk-io.c | 211 +++++++++++++++++++++++------------- fs/btrfs/disk-io.h | 2 - fs/btrfs/super.c | 20 +++- fs/btrfs/transaction.c | 3 +- fs/btrfs/volumes.c | 283 ++++++++++++++++++++++++++++++++++++++++++++++--- fs/btrfs/volumes.h | 27 +++++ 9 files changed, 571 insertions(+), 99 deletions(-)
Tested-by: Austin S. Hemmelgarn <ahferro...@gmail.com>
smime.p7s
Description: S/MIME Cryptographic Signature