Re: [OpenZFS Developer] [zfs] Re: [Review] #3525 Persistent L2ARC - UPDATED

Prakash Surya Wed, 16 Oct 2013 13:03:35 -0700

On Wed, Oct 16, 2013 at 11:55:10AM -0700, [email protected] wrote:
> In a nutshell Richard -- you're looking at this from a computer science 
> theory perspective, without understanding the pragmatic issues.  The 
> pragmatic engineering solution is to keep the chains small, and optimize a 
> solution for small chains.  That's what Sašo has done.   This is the 
> difference between an engineer and a scientist.  In this particular case, my 
> KISS values definitely lead me more towards the engineering approach. 
> 
> Its true in the very worst case AVL trees will win, and have a much less bad 
> worst case.  But its also true that if Sašo's work at keeping the chain sizes 
> is small is effective, then the AVL trees introduce substantial complexity 
> and hurt performance in the typical case.  The worst case should be 
> sufficiently exceedingly rarer that we can simply discount it.  If it isn't, 
> then we need to go back and rethink our hash algorithm.


I agree. We *need* to keep the chains short. Period. Thus, asymptotic
analysis doesn't really apply here.

> 
> Almost nobody will ever need to tune these hash sizes.  I'd argue that we 
> could auto tune on boot based on RAM size -- that would be a good solution.  
> (Not very many people have systems that allow dynamic RAM addition, anyway.  
> I'm not aware of any *x86* solutions that do… and we can't properly support 
> the SPARC variants that have such support anyway.)

I believe the current solution auto tunes the hash table based on the
physical RAM size.

-- 
Cheers, Prakash

> 
>       - Garrett
> 
> On Oct 16, 2013, at 11:07 AM, Richard Yao <[email protected]> wrote:
> 
> > On 10/16/2013 01:27 PM, Prakash Surya wrote:> On Wed, Oct 16, 2013 at
> > 02:57:08AM +0100, Saso Kiselkov wrote:
> >>> On 10/16/13 2:42 AM, Prakash Surya wrote:
> >>>> OK, that is where I assumed the speed up was coming from (shorter
> >>>> chains leading to faster lookups).
> >>>> 
> >>>> I also assumed there would be a "bucket lock" that needs to be acquired
> >>>> during this lookup similar to the dbuf hash (which would affect
> >>>> concurrency), but I guess that isn't the case here (I haven't studied
> >>>> the arc hash code as well as I have the dbuf hash code).
> >>> 
> >>> It's mostly the same. Dbuf would also benefit a bit from restructuring,
> >>> perhaps.
> >>> 
> >>>> So, if this simply comes down to a hash collision issue, can't we try
> >>>> and take this a bit further.. Can we make the hash size be completely
> >>>> dynamic? Instead of using another heuristic, can we grow (and shrink?)
> >>>> the hash as the workload demands? So if the max chain depth reaches a
> >>>> threshold, we increase the number of hash buckets (e.g. double it).
> >>>> 
> >>>> Of course the details of performing the grow (i.e. rehash) operation
> >>>> would need to be worked out so it doesn't affect performance,
> >>>> consistency, etc.. But from a theoretical stand point, moving it to be
> >>>> sized dynamically seems like a much better solution, IMO.
> >>> 
> >>> Several problems with this approach:
> >>> 
> >>> 1) unpredictability - when do trigger it and by what criteria?
> >> 
> >> This would be need to be ironed out, but I was thinking that if the max
> >> chain length grows above X, then grow the hash. What the value of X is
> >> could be a tunable with some default value that is determined through
> >> testing certain workloads.
> >> 
> >>> 2) the extreme expense of rehashing everything (easily dozens of
> >>>   seconds of one CPU pegged at 100% while everything else grinds
> >>>   to a halt as the ARC is inaccessible)
> >> 
> >> Without doing anything clever, yes. You would have to lock the entire
> >> hash table, allocate the new table, and then move everything over.
> >> 
> >> I was hoping there might be a clever way to do this in the background,
> >> so as not to affect performance "too much".
> >> 
> >> Thinking out loud here, but maybe something incremental. For example, a
> >> background thread rehashes things one bucket at a time, and uses a index
> >> to indicate if a bucket is valid or not. So..
> >> 
> >>    read -> hash to index i of old/current table
> >>    if (bg_thread_is_rehashing && i < rehash_index)
> >>        rehash buffer to get index j of new table
> >>        look for buffer in new table using index j
> >>    else
> >>        look for buffer in old/current table using index i
> >> 
> >> And the background thread will need to lock each bucket, rehash the
> >> buffers into the new table, increment the "rehash_index", unlock the
> > bucket
> >> and then repeat until all buckets are rehashed.
> >> 
> >> Granted, I haven't given this much thought, but if it's possible to do
> >> in an unobtrusive way, I think it's worth pursuing.
> >> 
> >>> 3) hard to diagnose latency spikes due to problem #2
> >> 
> >> It would have to be done so as not to introduce these latency spikes.
> >> 
> >>> 
> >>> The easiest approach is just to give the admin a knob which they can
> >>> twiddle at boot and then leave alone. If performance is really a
> >>> problem, they can schedule downtime. Doing it at runtime has its
> >>> problems and doing it automatically is dangerous.
> >> 
> >> I agree that easiest approach is to just statically size the hash table,
> >> but that doesn't really solve the problem. Rather we're just kicking the
> >> can down the road, IMO.
> >> 
> >> If the completely dynamic approach isn't tractable, why split the table
> >> into a 2D array? Why not just increase the size of it, and keep it a 1D
> >> array?
> >> 
> > 
> > I was using my iPad to compose replies and my mail client took the
> > conversation between me and Saso off list without me noticing until now.
> > Apparently, Saso did not notice either (I asked him), so we are bringing
> > it back to the list now.
> > 
> > Anyway, the hash table size is neither static nor dynamically sized both
> > at this time and with Saso' modifications. It is variable based on the
> > amount of system memory. Therefore, I do not think it is correct to say
> > that we are kicking the problem down the road. However, I do think that
> > it is correct to say that we are overly reliant on proper hash table
> > tuning and we can do better. The three of us appear to have different
> > positions on this.
> > 
> > 1. Saso seems to think that the system administrator should tune the
> > system, which neither you nor I like.
> > 
> > 2. You think that we should automatically tune the size of the hash
> > table (i.e. extendible hashing) depending on the work load. Saso does
> > not think that can be done without runtime lags. I think that we already
> > use extendible hashing in the ZAP and the algorithm used there could be
> > evaluated for use here, but I have not paid much attention to it.
> > 
> > 3. I think using AVL trees would handle hash collisions more gracefully
> > and reduce the benefit of any kind of tuning that could be done either
> > manually or by a runtime algorithm.
> > 
> > Populating a bucket containing an AVL tree with N nodes requires
> > O(N*log(N)) time while populating a bucket containing a linked list with
> > N nodes requires O(N^2) time. Doing a lookup using an AVL tree in a
> > bucket with N nodes requires O(log(N)) time while doing a lookup using a
> > linked list in a bucket with N nodes requires O(N) time.
> > 
> > The AVL tree approach is clearly superior from an asymtotic standpoint.
> > 
> > Saso had performance numbers from some benchmarks on a system with 8GB
> > of RAM, which I will quote:
> > 
> >                             Rebuild time
> > Old hash table (128k/1GB)   ~50s
> > 256x AVL trees                      ~20s
> > New hash table (1MB/1GB)    ~10s
> > 
> > The old hashtable using linked lists had 131072 hash buckets. The new
> > hash table using linked lists has 1048576 hash buckets. The hash table
> > using AVL trees had 256 buckets, which is a clear handicap, but it still
> > beats the old hash table by a clear margin and came within a factor of 2
> > of the new hash table.
> > 
> > Here is the calculation that Saso did earlier to show how much time we
> > take to do a lookup in the AVL tree when we have 256 of them:
> > 
> > $ echo 'l(6.4 * 10^6 / 256) / l(2)' | bc -l
> > 14.60964047443681173949
> > 
> > Lets compare that to the time we take to do a lookup in an AVL tree when
> > we have 65536:
> > 
> > $ echo 'l(6.4 * 10^6 / 65536) / l(2)' | bc -l
> > 6.60964047443681173940
> > 
> > That is a factor of 2 better. The rebuild times require the same number
> > of lookups per insertion, so we should do better overall to just use
> > 65536 AVL trees (8192 per 1GB of memory). The rebuild time in this
> > instance should be slightly lower and performance when the hash table
> > size is not tuned properly will degrade logarithmically, rather than
> > linearly.
> > 
> > It is also possible for the number of entries to be significantly
> > greater than Saso assumed in his calculations. An example of this would
> > be systems that do virtualization on top of zvols with volblocksize=4K
> > (or worse, volblocksize=512). Using AVL trees would ensure that such
> > workloads are handled gracefully while the current linked list approach
> > would not.
> > 
> > It is true that the system administrator could always make things better
> > with the tunable that Saso proposed, but I do not think it is realistic
> > to expect more than a small percentage of systems that would benefit
> > from such tuning to receive it from the system administrator.
> > 
> > The AVL tree approach that I suggest is not mutually exclusive with your
> > suggestion of extendible hashing or Saso's idea to let the system
> > administrator do manual tuning, but I think it will significantly reduce
> > the need for either of them.
> > 
> > _______________________________________________
> > developer mailing list
> > [email protected]
> > http://lists.open-zfs.org/mailman/listinfo/developer
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [OpenZFS Developer] [zfs] Re: [Review] #3525 Persistent L2ARC - UPDATED

Reply via email to