Re: [OpenZFS Developer] [zfs] Re: [Review] #3525 Persistent L2ARC - UPDATED

garrett . damore Wed, 16 Oct 2013 11:56:59 -0700

In a nutshell Richard -- you're looking at this from a computer science theory 
perspective, without understanding the pragmatic issues.  The pragmatic 
engineering solution is to keep the chains small, and optimize a solution for 
small chains.  That's what Sašo has done.   This is the difference between an 
engineer and a scientist.  In this particular case, my KISS values definitely 
lead me more towards the engineering approach.


Its true in the very worst case AVL trees will win, and have a much less bad 
worst case.  But its also true that if Sašo's work at keeping the chain sizes 
is small is effective, then the AVL trees introduce substantial complexity and 
hurt performance in the typical case.  The worst case should be sufficiently 
exceedingly rarer that we can simply discount it.  If it isn't, then we need to 
go back and rethink our hash algorithm.

Almost nobody will ever need to tune these hash sizes.  I'd argue that we could 
auto tune on boot based on RAM size -- that would be a good solution.  (Not 
very many people have systems that allow dynamic RAM addition, anyway.  I'm not 
aware of any *x86* solutions that do… and we can't properly support the SPARC 
variants that have such support anyway.)

        - Garrett

On Oct 16, 2013, at 11:07 AM, Richard Yao <[email protected]> wrote:

> On 10/16/2013 01:27 PM, Prakash Surya wrote:> On Wed, Oct 16, 2013 at
> 02:57:08AM +0100, Saso Kiselkov wrote:
>>> On 10/16/13 2:42 AM, Prakash Surya wrote:
>>>> OK, that is where I assumed the speed up was coming from (shorter
>>>> chains leading to faster lookups).
>>>> 
>>>> I also assumed there would be a "bucket lock" that needs to be acquired
>>>> during this lookup similar to the dbuf hash (which would affect
>>>> concurrency), but I guess that isn't the case here (I haven't studied
>>>> the arc hash code as well as I have the dbuf hash code).
>>> 
>>> It's mostly the same. Dbuf would also benefit a bit from restructuring,
>>> perhaps.
>>> 
>>>> So, if this simply comes down to a hash collision issue, can't we try
>>>> and take this a bit further.. Can we make the hash size be completely
>>>> dynamic? Instead of using another heuristic, can we grow (and shrink?)
>>>> the hash as the workload demands? So if the max chain depth reaches a
>>>> threshold, we increase the number of hash buckets (e.g. double it).
>>>> 
>>>> Of course the details of performing the grow (i.e. rehash) operation
>>>> would need to be worked out so it doesn't affect performance,
>>>> consistency, etc.. But from a theoretical stand point, moving it to be
>>>> sized dynamically seems like a much better solution, IMO.
>>> 
>>> Several problems with this approach:
>>> 
>>> 1) unpredictability - when do trigger it and by what criteria?
>> 
>> This would be need to be ironed out, but I was thinking that if the max
>> chain length grows above X, then grow the hash. What the value of X is
>> could be a tunable with some default value that is determined through
>> testing certain workloads.
>> 
>>> 2) the extreme expense of rehashing everything (easily dozens of
>>>   seconds of one CPU pegged at 100% while everything else grinds
>>>   to a halt as the ARC is inaccessible)
>> 
>> Without doing anything clever, yes. You would have to lock the entire
>> hash table, allocate the new table, and then move everything over.
>> 
>> I was hoping there might be a clever way to do this in the background,
>> so as not to affect performance "too much".
>> 
>> Thinking out loud here, but maybe something incremental. For example, a
>> background thread rehashes things one bucket at a time, and uses a index
>> to indicate if a bucket is valid or not. So..
>> 
>>    read -> hash to index i of old/current table
>>    if (bg_thread_is_rehashing && i < rehash_index)
>>        rehash buffer to get index j of new table
>>        look for buffer in new table using index j
>>    else
>>        look for buffer in old/current table using index i
>> 
>> And the background thread will need to lock each bucket, rehash the
>> buffers into the new table, increment the "rehash_index", unlock the
> bucket
>> and then repeat until all buckets are rehashed.
>> 
>> Granted, I haven't given this much thought, but if it's possible to do
>> in an unobtrusive way, I think it's worth pursuing.
>> 
>>> 3) hard to diagnose latency spikes due to problem #2
>> 
>> It would have to be done so as not to introduce these latency spikes.
>> 
>>> 
>>> The easiest approach is just to give the admin a knob which they can
>>> twiddle at boot and then leave alone. If performance is really a
>>> problem, they can schedule downtime. Doing it at runtime has its
>>> problems and doing it automatically is dangerous.
>> 
>> I agree that easiest approach is to just statically size the hash table,
>> but that doesn't really solve the problem. Rather we're just kicking the
>> can down the road, IMO.
>> 
>> If the completely dynamic approach isn't tractable, why split the table
>> into a 2D array? Why not just increase the size of it, and keep it a 1D
>> array?
>> 
> 
> I was using my iPad to compose replies and my mail client took the
> conversation between me and Saso off list without me noticing until now.
> Apparently, Saso did not notice either (I asked him), so we are bringing
> it back to the list now.
> 
> Anyway, the hash table size is neither static nor dynamically sized both
> at this time and with Saso' modifications. It is variable based on the
> amount of system memory. Therefore, I do not think it is correct to say
> that we are kicking the problem down the road. However, I do think that
> it is correct to say that we are overly reliant on proper hash table
> tuning and we can do better. The three of us appear to have different
> positions on this.
> 
> 1. Saso seems to think that the system administrator should tune the
> system, which neither you nor I like.
> 
> 2. You think that we should automatically tune the size of the hash
> table (i.e. extendible hashing) depending on the work load. Saso does
> not think that can be done without runtime lags. I think that we already
> use extendible hashing in the ZAP and the algorithm used there could be
> evaluated for use here, but I have not paid much attention to it.
> 
> 3. I think using AVL trees would handle hash collisions more gracefully
> and reduce the benefit of any kind of tuning that could be done either
> manually or by a runtime algorithm.
> 
> Populating a bucket containing an AVL tree with N nodes requires
> O(N*log(N)) time while populating a bucket containing a linked list with
> N nodes requires O(N^2) time. Doing a lookup using an AVL tree in a
> bucket with N nodes requires O(log(N)) time while doing a lookup using a
> linked list in a bucket with N nodes requires O(N) time.
> 
> The AVL tree approach is clearly superior from an asymtotic standpoint.
> 
> Saso had performance numbers from some benchmarks on a system with 8GB
> of RAM, which I will quote:
> 
>                               Rebuild time
> Old hash table (128k/1GB)     ~50s
> 256x AVL trees                        ~20s
> New hash table (1MB/1GB)      ~10s
> 
> The old hashtable using linked lists had 131072 hash buckets. The new
> hash table using linked lists has 1048576 hash buckets. The hash table
> using AVL trees had 256 buckets, which is a clear handicap, but it still
> beats the old hash table by a clear margin and came within a factor of 2
> of the new hash table.
> 
> Here is the calculation that Saso did earlier to show how much time we
> take to do a lookup in the AVL tree when we have 256 of them:
> 
> $ echo 'l(6.4 * 10^6 / 256) / l(2)' | bc -l
> 14.60964047443681173949
> 
> Lets compare that to the time we take to do a lookup in an AVL tree when
> we have 65536:
> 
> $ echo 'l(6.4 * 10^6 / 65536) / l(2)' | bc -l
> 6.60964047443681173940
> 
> That is a factor of 2 better. The rebuild times require the same number
> of lookups per insertion, so we should do better overall to just use
> 65536 AVL trees (8192 per 1GB of memory). The rebuild time in this
> instance should be slightly lower and performance when the hash table
> size is not tuned properly will degrade logarithmically, rather than
> linearly.
> 
> It is also possible for the number of entries to be significantly
> greater than Saso assumed in his calculations. An example of this would
> be systems that do virtualization on top of zvols with volblocksize=4K
> (or worse, volblocksize=512). Using AVL trees would ensure that such
> workloads are handled gracefully while the current linked list approach
> would not.
> 
> It is true that the system administrator could always make things better
> with the tunable that Saso proposed, but I do not think it is realistic
> to expect more than a small percentage of systems that would benefit
> from such tuning to receive it from the system administrator.
> 
> The AVL tree approach that I suggest is not mutually exclusive with your
> suggestion of extendible hashing or Saso's idea to let the system
> administrator do manual tuning, but I think it will significantly reduce
> the need for either of them.
> 
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer

_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [OpenZFS Developer] [zfs] Re: [Review] #3525 Persistent L2ARC - UPDATED

Reply via email to