Re: [developer] two-phase scrub/resilver

Matthew Ahrens Mon, 11 Jul 2016 10:04:51 -0700

On Mon, Jul 11, 2016 at 9:54 AM, Saso Kiselkov <skiselkov...@gmail.com>
wrote:


> On 7/11/16 6:37 PM, Matthew Ahrens wrote:
> >
> >
> >  Very cool.  We should definitely collaborate.
>
> Agree!
>
> >     We avoid having to do multiple passes over the data by grouping
> pointers
> >     by proximity and sorting in a more complicated way, rather than just
> >     doing a naive elevator each pass. This results in performance that,
> >     depending on how badly randomized the layout is, yields upwards of
> 10x
> >     performance improvement (and in many cases approaches sequential
> >     throughput). We've not implemented the storing of the partial-sorted
> >     list on disk, as that would be a far more complex undertaking than
> just
> >     keeping it in memory. That having been said, our memory overhead is
> >     essentially zero compared to on-disk metadata, i.e. we are able to
> sort
> >     and manipulate all sortable BPs in memory in essentially the same
> amount
> >     of space that the metadata takes up on disk. IOW, as long as your RAM
> >     quantity is not too far off being 1% of your disk storage's
> capacity, we
> >     can sort extremely well and achieve near sequential throughput.
> >
> >
> > So you need to keep all the BP's in the pool in memory at once?  What
> > happens if they don't fit?  Falls back on the current non-sorted scrub?
>
> So the way it works is that we have essentially split up scrub into two
> cooperating threads, the "scanner" and the "reader". The scanner pushes
> BPs into per-device sorting queue, which consists of a set of two AVL
> trees and a range tree. In the range tree, BPs are joined up into
> extents and we allow for a certain inter-BP gap (i.e. if the offsets are
> close enough, we consider it worthwhile to read them at once). We also
> track the "fill" of an extent, i.e. how much of the extent is comprised
> of actual readable data and how much is are inter-BP gaps. While
> scanning is going on, we continuously sort these extents such that we
> always keep the "juiciest" (largest and most contiguous) ones at the front.
>
> If we hit a configurable memory cap, we have the scanner pause for a bit
> and engage the reader so that it starts consuming the queue, issuing all
> the reads. Previously we did this in parallel, but I found that doing
> only one at a time helps overall throughput (because metadata reads are
> random, whereas sorted extent reading is sequential, so it helps not
> interrupting it).
>

Ah, that's very neat.  Seems like a nice tradeoff between not using too
much memory, and still getting a substantial (though not completely
optimal) performance win!


>
> > We didn't think that was OK, which is why we allowed multiple passes
> > over the metadata. If you're using recordsize=8k (or volblocksize=8k),
> > the metadata is 16GB per 1TB of disk space.  Though I imagine
> > compression or fancy encoding of the BP could improve that somewhat.
>
> For simplicity's sake, I've implemented the sorting such that we only do
> it for level=0 blocks and only for blocks with copies=1. Removing either
> of these seemed like a rather complex addition to the algorithm for
> relatively little payoff. That is not to say that it wouldn't work, but
> basically I was trying to keep this to a relatively simple modification.
>
> >     I don't know whether to continue this project at this point. I'd
> like to
> >     avoid having to competing implementations of essentially the same
> thing.
> >
> > Let me see if I can find our existing code and share it with you all.
> 
> I'll try to throw our code up somewhere as well.
> 
> Cheers,
> --
> Saso
> 



-------------------------------------------
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com

Re: [developer] two-phase scrub/resilver

Reply via email to