Re: [developer] two-phase scrub/resilver

Saso Kiselkov Mon, 11 Jul 2016 09:31:33 -0700

On 7/9/16 11:24 PM, Matthew Ahrens wrote:
> We had an intern work on "sorted scrub" last year.  Essentially the idea
> was to read the metadata to gather into memory all the BP's that need to
> be scrubbed, sort them by DVA (i.e. offset on disk) and then issue the
> scrub i/os in that sorted order.  However, memory can't hold all of the
> BP's, so we do multiple passes over the metadata, each pass gathering
> the next chunk of BP's.  This code is implemented and seems to work but
> probably needs some more testing and code cleanup.
> 
> One of the downsides of that approach is having to do multiple passes
> over the metadata if it doesn't all fit in memory (which it typically
> does not).  In some circumstances, this is worth it, but in others not
> so much.  To improve on that, we would like to do just one pass over the
> metadata to find all the block pointers.  Rather than storing the BP's
> sorted in memory, we would store them on disk, but only roughly sorted. 
> There are several ways we could do the sorting, which is one of the
> issues that makes this problem interesting.
> 
> We could divide each top-level vdev into chunks (like metaslabs, but
> probably a different number of them) and for each chunk have an on-disk
> list of BP's in that chunk that need to be scrubbed/resilvered.  When we
> find a BP, we would append it to the appropriate list.  Once we have
> traversed all the metadata to find all the BP's, we would load one
> chunk's list of BP's into memory, sort it, and then issue the resilver
> i/os in sorted order.
> 
> As an alternative, it might be better to accumulate as many BP's as fit
> in memory, sort them, and then write that sorted list to disk.  Then
> remove those BP's from memory and start filling memory again, write that
> list, etc.  Then read all the sorted lists in parallel to do a merge
> sort.  This has the advantage that we do not need to append to lots of
> lists as we are traversing the metadata. Instead we have to read from
> lots of lists as we do the scrubs, but this should be more efficient  We
> also don't have to determine beforehand how many chunks to divide each
> vdev into.
> 
> If you'd like to continue working on sorted scrub along these lines, let
> me know.


It seems multiple people converged on the same problem at the same time.
I've implemented this at Nexenta, but it's not quite release-ready yet.
We avoid having to do multiple passes over the data by grouping pointers
by proximity and sorting in a more complicated way, rather than just
doing a naive elevator each pass. This results in performance that,
depending on how badly randomized the layout is, yields upwards of 10x
performance improvement (and in many cases approaches sequential
throughput). We've not implemented the storing of the partial-sorted
list on disk, as that would be a far more complex undertaking than just
keeping it in memory. That having been said, our memory overhead is
essentially zero compared to on-disk metadata, i.e. we are able to sort
and manipulate all sortable BPs in memory in essentially the same amount
of space that the metadata takes up on disk. IOW, as long as your RAM
quantity is not too far off being 1% of your disk storage's capacity, we
can sort extremely well and achieve near sequential throughput.

I don't know whether to continue this project at this point. I'd like to
avoid having to competing implementations of essentially the same thing.

-- 
Saso


-------------------------------------------
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com

Re: [developer] two-phase scrub/resilver

Reply via email to