On 2015-11-12 03:15, Qu Wenruo wrote:
> Hi Anand,
> 
> Nice work.
> But I have some small questions about it.
> 
> Anand Jain wrote on 2015/11/09 18:56 +0800:
>> These set of patches provides btrfs hot spare and auto replace support
>> for you review and comments.
>>
>> First, here below are the simple example steps to configure the same:
>>
>> Add a spare device:
>>      btrfs spare add /dev/sde -f
> 
> I'm sorry but I didn't quite see the benefit of a spare device.
> 
> Let's take the following example:
> 
> 1) 2 RAID1 + 1 spare
>    (A + B) + C
> 
> 2) 3 RAID1
>    (A + B + C)
> Let's assume they are all 12G size, and there are 3 raid1 chunks.
> Each one is 3G size.
> 
> In my understanding, in normal operation case:
> 
> For case 1), all raid chunks should only be allocated into 2 RAID disks,
> and spare one should contains no raid1 chunks.
> 
>   A       B       C
> ------  ------  ------
> |free|  |free|  |free|
> ------  ------  |    |
> |3Ga1|  |3Ga2|  |    |
> ------  ------  |    |
> |3Gb1|  |3Gb2|  |    |
> ------  ------  |    |
> |3Gc1|  |3Gc2|  |    |
> ------  ------  ------
> 
> 
> For case 2), all raid1 chunks will be allocated into all 3 disks, making the 
> allocation more fair.
>   A       B       C
> ------  ------  ------
> |free|  |free|  |free|
> ------  ------  ------
> |free|  |free|  |free|
> ------  ------  ------
> |3Gb2|  |3Ga1|  |3Ga2|
> ------  ------  ------
> |3Gc1|  |3Gc2|  |3Gb1|
> ------  ------  ------
> 
> 
> At least in normal operation case, case 1) makes device C useless, and reduce 
> the total usable space.
> 
> In disk B failure case:
> 
> For case 1), we can auto replace B with C.
> And it will copy all data chunks from A to C.
> Need to copy 9G data.
> 
> And after replace:
>   A       B       C
> ------  ------  ------
> |free|  | X  |  |free|
> ------  ------  ------
> |3Ga1|  | X  |->|3Ga2|
> ------  ------  ------
> |3Gb1|  | X  |->|3Gb2|
> ------  ------  ------
> |3Gc1|  | X  |->|3Gc2|
> ------  ------  ------
> 
> 
> 
> For case 2), we can just relocate and recover the bad chunks in B.
> It it should only need to copy 6G data.
> 
> And after the "recovery", it should be much the same as case 1):
>   A       B       C
> ------  ------  ------
> |free|  | X  |  |free|
> ------  ------  ------
> |3Ga1|<\| X  |/>|3Gc1|
> ------  ------  ------
> |3Gb2| || X  |/ |3Ga2|
> ------  ------  ------
> |3Gc1| \| X  |  |3Gb1|
> ------  ------  ------
> 
> 
> IIRC, the only benefit of a spare device is, we can ensure there is enough 
> space for a device place.(If the failing one is no larger than spare).
> 
> But the cost is, increase in replace data copy and unfair chunk allocation.
> 
> So I am not sure if the cost is good enough for the case.
> At least, enhancing the chunk relocation to fulfill the case 2) will bring a 
> much smaller code base.
> 
> Thanks,
> Qu

Interesting analysis. Another difference between the two scenarios, is that in 
the first case (A+B+spare) is that the spare doesn't work until it is needed: 
less power consumption and when needed you are using a new disk instead of an 
used one. 

>>
>> OR if there is a spare device which is already added before the, just
>> run
>>
>>      btrfs dev scan [/dev/sde]
>>
>> this will register the spare device to the kernel.
>>
>>      btrfs fi show
>>      Label: none  uuid: 52f170c1-725c-457d-8cfd-d57090460091
>>     Total devices 2 FS bytes used 112.00KiB
>>     devid    1 size 2.00GiB used 417.50MiB path /dev/sdc
>>     devid    2 size 2.00GiB used 417.50MiB path /dev/sdd
>>
>>      Global spare
>>     device size 3.00GiB path /dev/sde
>>
>> Thats it.
>>
>> Auto replace:
>>   Replace happens automatically, that is when there is any write
>>   failed or flush failed, the device will be marked as failed, which
>>   will stop any further IO attempt to that device. And in the next commit
>>   thread cycle the auto replace will pick the spare device (/dev/sde is
>>   above example) to replace the failed device. And so the btrfs volume is
>>   back to a healthy state.
>>
>>
>> Its btrfs Global spare:
>>   as of now only global hot spare is supported, that is hot spare(s)
>>   are for all the btrfs FS in the system.
>>
>> No spare when device failed:
>>   It would scan for spare device at the rate of transaction commit
>>   and will trigger the auto replace when ever spare device is added.
>>
>> Priority:
>>   In some future work there can be some chronological order to pick
>>   a spare and the failed device.
>>
>>
>> Patches:
>>
>> Kernel:
>> First, it needs, Qu's per chunk missing device patchset,
>> which is part of the set here and also there is a light optimization
>> (patch 5/15) which was required as part of this enhancement.
>>
>> Next patches 7,8/15 brings in support, to manage the transition of
>> devices from online (no state) to offline OR failed state dynamically.
>> On top of static device state like the current "missing" state.
>>
>> Patch 9/15 fixes a bug where in we should have blocked the incompatible
>> feature at the device scan/add level instead/also at in the mount level.
>> This is because we don't have to bring a device into the device list,
>> if it is incompatible.
>>
>> Next patches 10,11,12,13/15 adds support for Spare device. For the
>> details on how to add a spare device kindly see further below.
>> For kernel with out spare feature supported the spare device
>> is kept away. And when the kernel supports the spare device, it will
>> inhibit from mounting it. Further these patch set provides helper
>> function to pick a spare device and release a spare device back to
>> the spare device pool.
>>
>> Patch 14/15 provides function for auto replace, this is mainly
>> from the existing replace code, and in the long run I see opportunity
>> to merge these code with the replace code that is triggered from
>> the user spare.
>>
>> Last 15/15, uses all these facilities, picks a failed device and
>> triggers a auto replace in a kthread (casualty_kthread())
>>
>>
>> Progs:
>> Would need 4 patches as listed below.
>>
>>
>> Known Bug:
>>
>> As now I see below stale kmem cache during module unload. Which
>> I am digging.
>> ------
>> BUG btrfs_path (Not tainted): Objects remaining in btrfs_path on 
>> kmem_cache_close()
>> ------
>>
>> Anand Jain (10):
>>    btrfs: optimize btrfs_check_degradable() for calls outside of barrier
>>    btrfs: introduce device dynamic state transition to offline or failed
>>    btrfs: check device for critical errors and mark failed
>>    btrfs: block incompatible optional features at scan
>>    btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV
>>    btrfs: add check not to mount a spare device
>>    btrfs: support btrfs dev scan for spare device
>>    btrfs: provide framework to get and put a spare device
>>    btrfs: introduce helper functions to perform hot replace
>>    btrfs: check for failed device and hot replace
>>
>> Qu Wenruo (5):
>>    btrfs: Introduce a new function to check if all chunks a OK for
>>      degraded mount
>>    btrfs: Do per-chunk check for mount time check
>>    btrfs: Do per-chunk degraded check for remount
>>    btrfs: Allow barrier_all_devices to do per-chunk device check
>>    btrfs: Cleanup num_tolerated_disk_barrier_failures
>>
>>   fs/btrfs/ctree.h       |   7 +-
>>   fs/btrfs/dev-replace.c | 116 ++++++++++++++++++++
>>   fs/btrfs/dev-replace.h |   1 +
>>   fs/btrfs/disk-io.c     | 211 +++++++++++++++++++++++-------------
>>   fs/btrfs/disk-io.h     |   2 -
>>   fs/btrfs/super.c       |  20 +++-
>>   fs/btrfs/transaction.c |   3 +-
>>   fs/btrfs/volumes.c     | 283 
>> ++++++++++++++++++++++++++++++++++++++++++++++---
>>   fs/btrfs/volumes.h     |  27 +++++
>>   9 files changed, 571 insertions(+), 99 deletions(-)
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to