On Mon, Jul 11, 2016 at 9:30 AM, Saso Kiselkov <skiselkov...@gmail.com> wrote:
> On 7/9/16 11:24 PM, Matthew Ahrens wrote: > > We had an intern work on "sorted scrub" last year. Essentially the idea > > was to read the metadata to gather into memory all the BP's that need to > > be scrubbed, sort them by DVA (i.e. offset on disk) and then issue the > > scrub i/os in that sorted order. However, memory can't hold all of the > > BP's, so we do multiple passes over the metadata, each pass gathering > > the next chunk of BP's. This code is implemented and seems to work but > > probably needs some more testing and code cleanup. > > > > One of the downsides of that approach is having to do multiple passes > > over the metadata if it doesn't all fit in memory (which it typically > > does not). In some circumstances, this is worth it, but in others not > > so much. To improve on that, we would like to do just one pass over the > > metadata to find all the block pointers. Rather than storing the BP's > > sorted in memory, we would store them on disk, but only roughly sorted. > > There are several ways we could do the sorting, which is one of the > > issues that makes this problem interesting. > > > > We could divide each top-level vdev into chunks (like metaslabs, but > > probably a different number of them) and for each chunk have an on-disk > > list of BP's in that chunk that need to be scrubbed/resilvered. When we > > find a BP, we would append it to the appropriate list. Once we have > > traversed all the metadata to find all the BP's, we would load one > > chunk's list of BP's into memory, sort it, and then issue the resilver > > i/os in sorted order. > > > > As an alternative, it might be better to accumulate as many BP's as fit > > in memory, sort them, and then write that sorted list to disk. Then > > remove those BP's from memory and start filling memory again, write that > > list, etc. Then read all the sorted lists in parallel to do a merge > > sort. This has the advantage that we do not need to append to lots of > > lists as we are traversing the metadata. Instead we have to read from > > lots of lists as we do the scrubs, but this should be more efficient We > > also don't have to determine beforehand how many chunks to divide each > > vdev into. > > > > If you'd like to continue working on sorted scrub along these lines, let > > me know. > > It seems multiple people converged on the same problem at the same time. > I've implemented this at Nexenta, but it's not quite release-ready yet. > Very cool. We should definitely collaborate. We avoid having to do multiple passes over the data by grouping pointers > by proximity and sorting in a more complicated way, rather than just > doing a naive elevator each pass. This results in performance that, > depending on how badly randomized the layout is, yields upwards of 10x > performance improvement (and in many cases approaches sequential > throughput). We've not implemented the storing of the partial-sorted > list on disk, as that would be a far more complex undertaking than just > keeping it in memory. That having been said, our memory overhead is > essentially zero compared to on-disk metadata, i.e. we are able to sort > and manipulate all sortable BPs in memory in essentially the same amount > of space that the metadata takes up on disk. IOW, as long as your RAM > quantity is not too far off being 1% of your disk storage's capacity, we > can sort extremely well and achieve near sequential throughput. > So you need to keep all the BP's in the pool in memory at once? What happens if they don't fit? Falls back on the current non-sorted scrub? We didn't think that was OK, which is why we allowed multiple passes over the metadata. If you're using recordsize=8k (or volblocksize=8k), the metadata is 16GB per 1TB of disk space. Though I imagine compression or fancy encoding of the BP could improve that somewhat. > > I don't know whether to continue this project at this point. I'd like to > avoid having to competing implementations of essentially the same thing. > Let me see if I can find our existing code and share it with you all. --matt > > -- > Saso > ------------------------------------------- openzfs-developer Archives: https://www.listbox.com/member/archive/274414/=now RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa Modify Your Subscription: https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c Powered by Listbox: http://www.listbox.com