On Wed, Oct 16, 2013 at 11:55:10AM -0700, [email protected] wrote: > In a nutshell Richard -- you're looking at this from a computer science > theory perspective, without understanding the pragmatic issues. The > pragmatic engineering solution is to keep the chains small, and optimize a > solution for small chains. That's what Sašo has done. This is the > difference between an engineer and a scientist. In this particular case, my > KISS values definitely lead me more towards the engineering approach. > > Its true in the very worst case AVL trees will win, and have a much less bad > worst case. But its also true that if Sašo's work at keeping the chain sizes > is small is effective, then the AVL trees introduce substantial complexity > and hurt performance in the typical case. The worst case should be > sufficiently exceedingly rarer that we can simply discount it. If it isn't, > then we need to go back and rethink our hash algorithm.
I agree. We *need* to keep the chains short. Period. Thus, asymptotic analysis doesn't really apply here. > > Almost nobody will ever need to tune these hash sizes. I'd argue that we > could auto tune on boot based on RAM size -- that would be a good solution. > (Not very many people have systems that allow dynamic RAM addition, anyway. > I'm not aware of any *x86* solutions that do… and we can't properly support > the SPARC variants that have such support anyway.) I believe the current solution auto tunes the hash table based on the physical RAM size. -- Cheers, Prakash > > - Garrett > > On Oct 16, 2013, at 11:07 AM, Richard Yao <[email protected]> wrote: > > > On 10/16/2013 01:27 PM, Prakash Surya wrote:> On Wed, Oct 16, 2013 at > > 02:57:08AM +0100, Saso Kiselkov wrote: > >>> On 10/16/13 2:42 AM, Prakash Surya wrote: > >>>> OK, that is where I assumed the speed up was coming from (shorter > >>>> chains leading to faster lookups). > >>>> > >>>> I also assumed there would be a "bucket lock" that needs to be acquired > >>>> during this lookup similar to the dbuf hash (which would affect > >>>> concurrency), but I guess that isn't the case here (I haven't studied > >>>> the arc hash code as well as I have the dbuf hash code). > >>> > >>> It's mostly the same. Dbuf would also benefit a bit from restructuring, > >>> perhaps. > >>> > >>>> So, if this simply comes down to a hash collision issue, can't we try > >>>> and take this a bit further.. Can we make the hash size be completely > >>>> dynamic? Instead of using another heuristic, can we grow (and shrink?) > >>>> the hash as the workload demands? So if the max chain depth reaches a > >>>> threshold, we increase the number of hash buckets (e.g. double it). > >>>> > >>>> Of course the details of performing the grow (i.e. rehash) operation > >>>> would need to be worked out so it doesn't affect performance, > >>>> consistency, etc.. But from a theoretical stand point, moving it to be > >>>> sized dynamically seems like a much better solution, IMO. > >>> > >>> Several problems with this approach: > >>> > >>> 1) unpredictability - when do trigger it and by what criteria? > >> > >> This would be need to be ironed out, but I was thinking that if the max > >> chain length grows above X, then grow the hash. What the value of X is > >> could be a tunable with some default value that is determined through > >> testing certain workloads. > >> > >>> 2) the extreme expense of rehashing everything (easily dozens of > >>> seconds of one CPU pegged at 100% while everything else grinds > >>> to a halt as the ARC is inaccessible) > >> > >> Without doing anything clever, yes. You would have to lock the entire > >> hash table, allocate the new table, and then move everything over. > >> > >> I was hoping there might be a clever way to do this in the background, > >> so as not to affect performance "too much". > >> > >> Thinking out loud here, but maybe something incremental. For example, a > >> background thread rehashes things one bucket at a time, and uses a index > >> to indicate if a bucket is valid or not. So.. > >> > >> read -> hash to index i of old/current table > >> if (bg_thread_is_rehashing && i < rehash_index) > >> rehash buffer to get index j of new table > >> look for buffer in new table using index j > >> else > >> look for buffer in old/current table using index i > >> > >> And the background thread will need to lock each bucket, rehash the > >> buffers into the new table, increment the "rehash_index", unlock the > > bucket > >> and then repeat until all buckets are rehashed. > >> > >> Granted, I haven't given this much thought, but if it's possible to do > >> in an unobtrusive way, I think it's worth pursuing. > >> > >>> 3) hard to diagnose latency spikes due to problem #2 > >> > >> It would have to be done so as not to introduce these latency spikes. > >> > >>> > >>> The easiest approach is just to give the admin a knob which they can > >>> twiddle at boot and then leave alone. If performance is really a > >>> problem, they can schedule downtime. Doing it at runtime has its > >>> problems and doing it automatically is dangerous. > >> > >> I agree that easiest approach is to just statically size the hash table, > >> but that doesn't really solve the problem. Rather we're just kicking the > >> can down the road, IMO. > >> > >> If the completely dynamic approach isn't tractable, why split the table > >> into a 2D array? Why not just increase the size of it, and keep it a 1D > >> array? > >> > > > > I was using my iPad to compose replies and my mail client took the > > conversation between me and Saso off list without me noticing until now. > > Apparently, Saso did not notice either (I asked him), so we are bringing > > it back to the list now. > > > > Anyway, the hash table size is neither static nor dynamically sized both > > at this time and with Saso' modifications. It is variable based on the > > amount of system memory. Therefore, I do not think it is correct to say > > that we are kicking the problem down the road. However, I do think that > > it is correct to say that we are overly reliant on proper hash table > > tuning and we can do better. The three of us appear to have different > > positions on this. > > > > 1. Saso seems to think that the system administrator should tune the > > system, which neither you nor I like. > > > > 2. You think that we should automatically tune the size of the hash > > table (i.e. extendible hashing) depending on the work load. Saso does > > not think that can be done without runtime lags. I think that we already > > use extendible hashing in the ZAP and the algorithm used there could be > > evaluated for use here, but I have not paid much attention to it. > > > > 3. I think using AVL trees would handle hash collisions more gracefully > > and reduce the benefit of any kind of tuning that could be done either > > manually or by a runtime algorithm. > > > > Populating a bucket containing an AVL tree with N nodes requires > > O(N*log(N)) time while populating a bucket containing a linked list with > > N nodes requires O(N^2) time. Doing a lookup using an AVL tree in a > > bucket with N nodes requires O(log(N)) time while doing a lookup using a > > linked list in a bucket with N nodes requires O(N) time. > > > > The AVL tree approach is clearly superior from an asymtotic standpoint. > > > > Saso had performance numbers from some benchmarks on a system with 8GB > > of RAM, which I will quote: > > > > Rebuild time > > Old hash table (128k/1GB) ~50s > > 256x AVL trees ~20s > > New hash table (1MB/1GB) ~10s > > > > The old hashtable using linked lists had 131072 hash buckets. The new > > hash table using linked lists has 1048576 hash buckets. The hash table > > using AVL trees had 256 buckets, which is a clear handicap, but it still > > beats the old hash table by a clear margin and came within a factor of 2 > > of the new hash table. > > > > Here is the calculation that Saso did earlier to show how much time we > > take to do a lookup in the AVL tree when we have 256 of them: > > > > $ echo 'l(6.4 * 10^6 / 256) / l(2)' | bc -l > > 14.60964047443681173949 > > > > Lets compare that to the time we take to do a lookup in an AVL tree when > > we have 65536: > > > > $ echo 'l(6.4 * 10^6 / 65536) / l(2)' | bc -l > > 6.60964047443681173940 > > > > That is a factor of 2 better. The rebuild times require the same number > > of lookups per insertion, so we should do better overall to just use > > 65536 AVL trees (8192 per 1GB of memory). The rebuild time in this > > instance should be slightly lower and performance when the hash table > > size is not tuned properly will degrade logarithmically, rather than > > linearly. > > > > It is also possible for the number of entries to be significantly > > greater than Saso assumed in his calculations. An example of this would > > be systems that do virtualization on top of zvols with volblocksize=4K > > (or worse, volblocksize=512). Using AVL trees would ensure that such > > workloads are handled gracefully while the current linked list approach > > would not. > > > > It is true that the system administrator could always make things better > > with the tunable that Saso proposed, but I do not think it is realistic > > to expect more than a small percentage of systems that would benefit > > from such tuning to receive it from the system administrator. > > > > The AVL tree approach that I suggest is not mutually exclusive with your > > suggestion of extendible hashing or Saso's idea to let the system > > administrator do manual tuning, but I think it will significantly reduce > > the need for either of them. > > > > _______________________________________________ > > developer mailing list > > [email protected] > > http://lists.open-zfs.org/mailman/listinfo/developer _______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
