On 2018年02月09日 03:07, Liu Bo wrote:
> On Thu, Feb 08, 2018 at 07:52:05PM +0100, Goffredo Baroncelli wrote:
>> On 02/06/2018 12:15 AM, Liu Bo wrote:
>> [...]
>>> One way to mitigate the data loss pain is to expose 'bad chunks',
>>> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
>>> to relocate the whole chunk and get the full raid6 protection again
>>> (if the relocation works).
>>
>> [...]
>> [...]
>>
>>> +
>>> +   /* read lock please */
>>> +   do {
>>> +           seq = read_seqbegin(&fs_info->bc_lock);
>>> +           list_for_each_entry(bc, &fs_info->bad_chunks, list) {
>>> +                   len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
>>> +                                   bc->chunk_offset);
>>> +                   /* chunk offset is u64 */
>>> +                   if (len >= PAGE_SIZE)
>>> +                           break;
>>> +           }
>>> +   } while (read_seqretry(&fs_info->bc_lock, seq));
>>
>> Using this interface, how many chunks can you list ? If I read the code 
>> correctly, only up to fill a kernel page.
>>
>> If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) 
>> it is possible to list only few hundred of chunks (~200). Not more; and the 
>> last one could be even listed incomplete.
>>
> 
> That's true.
> 
>> IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to 
>> list could be more than 200.
>>
>> My first suggestion is to limit the number of chunks to show to 200 (a page 
>> should be big enough to contains all these chunks offset). If the chunks 
>> number are greater, ends the list with a marker (something like '[...]\n').
>> This would solve the ambiguity about the fact that the list chunks are 
>> complete or not. Anyway you cannot list all the chunks.
>>
> 
> Good point, and I need to add one more field to each record to specify
> it's metadata or data.
> 
> So what I have in my mind is that this kind of error is rare and
> reflects bad sectors on disk, but if there are really that many
> errors, I think we need to replace the disk without hesitation.
> 
>> However, my second suggestions is to ... change completely the interface. 
>> What about adding a directory in sysfs, where each entry is a chunk ?
>>
>> Something like:
>>
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/type          # 
>> data/metadata/sys
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/profile               # 
>> dup/linear....
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/size          # size
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/devs          # chunks devs 

What about netlink interface?

Although it may needs an extra daemon to listen to it, and some guys
won't be happy about the abuse of daemon.

Thanks,
Qu

>>
>> And so on. 
>>
>> Checking  "[...]<chunks-offset>/devs", it would be easy understand if the 
>> chunk is in "degraded" mode or not.
> 
> I'm afraid we cannot do that as it'll occupy too much memory for large
> filesystems given a typical chunk size is 1GB.
> 
>>
>> However I have to admit that I don't know how feasible is iterate over a 
>> sysfs directory which is a map of a kernel objects list.
>>
>> I think that if these interface would be good enough, we could get rid of a 
>> lot of ioctl(TREE_SEARCH) from btrfs-progs. 
>>
> 
> TREE_SEARCH is faster than iterating sysfs (if we could), isn't it?
> 
> thanks,
> -liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to