On Thu, Feb 08, 2018 at 05:47:46PM +0800, Anand Jain wrote:
> 
> 
> On 02/06/2018 07:15 AM, Liu Bo wrote:
> > Btrfs tries its best to tolerate write errors, but kind of silently
> > (except some messages in kernel log).
> > 
> > For raid1 and raid10, this is usually not a problem because there is a
> > copy as backup, while for parity based raid setup, i.e. raid5 and
> > raid6, the problem is that, if a write error occurs due to some bad
> > sectors, one horizonal stripe becomes degraded and the number of write
> > errors it can tolerate gets reduced by one, now if two disk fails,
> > data may be lost forever.
> 
> This is equally true in raid1, raid10, and raid5.  Sorry I didn't get
> the point why degraded stripe is critical only to the parity based
> stripes (raid5/6)?

Hmm, right, it also applies for raid1 and raid10.

> And does it really need a bad chunk list to fix in case of parity
> based stripes or the balance without bad chunks list can fix as well?
>

A full balance can surely fix it, but the intent here is to only move
the chunk which the bad stripe belongs to so that a full expensive
balance could be avoided.

> > One way to mitigate the data loss pain is to expose 'bad chunks',
> > i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> > to relocate the whole chunk and get the full raid6 protection again
> > (if the relocation works).
> 
> Depending on the type of disk error its recovery action would vary. For
> example, it can be a complete disk fail or interim RW failure due to
> environmental/transport factors. The disk auto relocation will do the
> job of relocating the real bad blocks in the most of the modern disks.
> The challenging task will be to know where to draw the line between
> complete disk failure (failed) vs interim disk failure (offline) so I
> had plans of making it tunable base on number of disk errors.
>

Right, I agree with the configurable stuff.

> If it's confirmed that a disk is failed, the auto-replace with the hot
> spare disk will be its recovery action. Balance with a failed disk won't
> help.
>
> Patches to these are in the ML.
> 
> If the failure is momentary due to environmental factors, including the
> transport layer, then as we expect the disk with the data will come back
> we shouldn't kick in the hot spare, that is disk state offline, or maybe
> its a state where read old data is fine, but cannot write new data.
> I think you are addressing this interim state. It's better to define the
> disk states first so that its recovery action can be defined. I can
> revise the patches on that. So that replace VS re-balance using bad chunks
> can be decided.

That's true, with bad chunk info., one can decide himself whether to
balance or replace.  If the whole device failed, balance doesn't make
sense, will fail for sure.

> 
> > This introduces 'bad_chunks' in btrfs's per-fs sysfs directory.  Once
> > a chunk of raid5 or raid6 becomes degraded, it will appear in
> > 'bad_chunks'.
> 
> AFAIK a variable list of output is not allowed on sysfs.
>
> IMHO list of bad chunks won't help the user (it ok if its needed by
> kernel). It will help if you provide the list of affected-files
> so that the user can use it script to make additional interim external
> copy until the disk recovers from the interim error.
>

Probably we can do that, but given the complexity that it needs to
walk btree several times to find the name/path/other stuff, the
granularity of chunk unit is good for the usecase I have, as users
here do not want to interfere too much with the low level filesystem
by a script, probably the only feasible thing is to click some buttons
on GUI.

Thanks a lot for your inputs.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to