On Mon, Jul 11, 2016 at 9:30 AM, Saso Kiselkov <skiselkov...@gmail.com>
wrote:

> On 7/9/16 11:24 PM, Matthew Ahrens wrote:
> > We had an intern work on "sorted scrub" last year.  Essentially the idea
> > was to read the metadata to gather into memory all the BP's that need to
> > be scrubbed, sort them by DVA (i.e. offset on disk) and then issue the
> > scrub i/os in that sorted order.  However, memory can't hold all of the
> > BP's, so we do multiple passes over the metadata, each pass gathering
> > the next chunk of BP's.  This code is implemented and seems to work but
> > probably needs some more testing and code cleanup.
> >
> > One of the downsides of that approach is having to do multiple passes
> > over the metadata if it doesn't all fit in memory (which it typically
> > does not).  In some circumstances, this is worth it, but in others not
> > so much.  To improve on that, we would like to do just one pass over the
> > metadata to find all the block pointers.  Rather than storing the BP's
> > sorted in memory, we would store them on disk, but only roughly sorted.
> > There are several ways we could do the sorting, which is one of the
> > issues that makes this problem interesting.
> >
> > We could divide each top-level vdev into chunks (like metaslabs, but
> > probably a different number of them) and for each chunk have an on-disk
> > list of BP's in that chunk that need to be scrubbed/resilvered.  When we
> > find a BP, we would append it to the appropriate list.  Once we have
> > traversed all the metadata to find all the BP's, we would load one
> > chunk's list of BP's into memory, sort it, and then issue the resilver
> > i/os in sorted order.
> >
> > As an alternative, it might be better to accumulate as many BP's as fit
> > in memory, sort them, and then write that sorted list to disk.  Then
> > remove those BP's from memory and start filling memory again, write that
> > list, etc.  Then read all the sorted lists in parallel to do a merge
> > sort.  This has the advantage that we do not need to append to lots of
> > lists as we are traversing the metadata. Instead we have to read from
> > lots of lists as we do the scrubs, but this should be more efficient  We
> > also don't have to determine beforehand how many chunks to divide each
> > vdev into.
> >
> > If you'd like to continue working on sorted scrub along these lines, let
> > me know.
>
> It seems multiple people converged on the same problem at the same time.
> I've implemented this at Nexenta, but it's not quite release-ready yet.
>

 Very cool.  We should definitely collaborate.

We avoid having to do multiple passes over the data by grouping pointers
> by proximity and sorting in a more complicated way, rather than just
> doing a naive elevator each pass. This results in performance that,
> depending on how badly randomized the layout is, yields upwards of 10x
> performance improvement (and in many cases approaches sequential
> throughput). We've not implemented the storing of the partial-sorted
> list on disk, as that would be a far more complex undertaking than just
> keeping it in memory. That having been said, our memory overhead is
> essentially zero compared to on-disk metadata, i.e. we are able to sort
> and manipulate all sortable BPs in memory in essentially the same amount
> of space that the metadata takes up on disk. IOW, as long as your RAM
> quantity is not too far off being 1% of your disk storage's capacity, we
> can sort extremely well and achieve near sequential throughput.
>

So you need to keep all the BP's in the pool in memory at once?  What
happens if they don't fit?  Falls back on the current non-sorted scrub?

We didn't think that was OK, which is why we allowed multiple passes over
the metadata. If you're using recordsize=8k (or volblocksize=8k), the
metadata is 16GB per 1TB of disk space.  Though I imagine compression or
fancy encoding of the BP could improve that somewhat.


>
> I don't know whether to continue this project at this point. I'd like to
> avoid having to competing implementations of essentially the same thing.
>

Let me see if I can find our existing code and share it with you all.

--matt


> 
> --
> Saso
> 



-------------------------------------------
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com

Reply via email to