On Mon, Jul 11, 2016 at 9:54 AM, Saso Kiselkov <skiselkov...@gmail.com> wrote:
> On 7/11/16 6:37 PM, Matthew Ahrens wrote: > > > > > > Very cool. We should definitely collaborate. > > Agree! > > > We avoid having to do multiple passes over the data by grouping > pointers > > by proximity and sorting in a more complicated way, rather than just > > doing a naive elevator each pass. This results in performance that, > > depending on how badly randomized the layout is, yields upwards of > 10x > > performance improvement (and in many cases approaches sequential > > throughput). We've not implemented the storing of the partial-sorted > > list on disk, as that would be a far more complex undertaking than > just > > keeping it in memory. That having been said, our memory overhead is > > essentially zero compared to on-disk metadata, i.e. we are able to > sort > > and manipulate all sortable BPs in memory in essentially the same > amount > > of space that the metadata takes up on disk. IOW, as long as your RAM > > quantity is not too far off being 1% of your disk storage's > capacity, we > > can sort extremely well and achieve near sequential throughput. > > > > > > So you need to keep all the BP's in the pool in memory at once? What > > happens if they don't fit? Falls back on the current non-sorted scrub? > > So the way it works is that we have essentially split up scrub into two > cooperating threads, the "scanner" and the "reader". The scanner pushes > BPs into per-device sorting queue, which consists of a set of two AVL > trees and a range tree. In the range tree, BPs are joined up into > extents and we allow for a certain inter-BP gap (i.e. if the offsets are > close enough, we consider it worthwhile to read them at once). We also > track the "fill" of an extent, i.e. how much of the extent is comprised > of actual readable data and how much is are inter-BP gaps. While > scanning is going on, we continuously sort these extents such that we > always keep the "juiciest" (largest and most contiguous) ones at the front. > > If we hit a configurable memory cap, we have the scanner pause for a bit > and engage the reader so that it starts consuming the queue, issuing all > the reads. Previously we did this in parallel, but I found that doing > only one at a time helps overall throughput (because metadata reads are > random, whereas sorted extent reading is sequential, so it helps not > interrupting it). > Ah, that's very neat. Seems like a nice tradeoff between not using too much memory, and still getting a substantial (though not completely optimal) performance win! > > > We didn't think that was OK, which is why we allowed multiple passes > > over the metadata. If you're using recordsize=8k (or volblocksize=8k), > > the metadata is 16GB per 1TB of disk space. Though I imagine > > compression or fancy encoding of the BP could improve that somewhat. > > For simplicity's sake, I've implemented the sorting such that we only do > it for level=0 blocks and only for blocks with copies=1. Removing either > of these seemed like a rather complex addition to the algorithm for > relatively little payoff. That is not to say that it wouldn't work, but > basically I was trying to keep this to a relatively simple modification. > > > I don't know whether to continue this project at this point. I'd > like to > > avoid having to competing implementations of essentially the same > thing. > > > > Let me see if I can find our existing code and share it with you all. > > I'll try to throw our code up somewhere as well. > > Cheers, > -- > Saso > ------------------------------------------- openzfs-developer Archives: https://www.listbox.com/member/archive/274414/=now RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa Modify Your Subscription: https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c Powered by Listbox: http://www.listbox.com