On 10/16/13 6:27 PM, Prakash Surya wrote: > On Wed, Oct 16, 2013 at 02:57:08AM +0100, Saso Kiselkov wrote: >> On 10/16/13 2:42 AM, Prakash Surya wrote: >>> OK, that is where I assumed the speed up was coming from (shorter >>> chains leading to faster lookups). >>> >>> I also assumed there would be a "bucket lock" that needs to be acquired >>> during this lookup similar to the dbuf hash (which would affect >>> concurrency), but I guess that isn't the case here (I haven't studied >>> the arc hash code as well as I have the dbuf hash code). >> >> It's mostly the same. Dbuf would also benefit a bit from restructuring, >> perhaps. >> >>> So, if this simply comes down to a hash collision issue, can't we try >>> and take this a bit further.. Can we make the hash size be completely >>> dynamic? Instead of using another heuristic, can we grow (and shrink?) >>> the hash as the workload demands? So if the max chain depth reaches a >>> threshold, we increase the number of hash buckets (e.g. double it). >>> >>> Of course the details of performing the grow (i.e. rehash) operation >>> would need to be worked out so it doesn't affect performance, >>> consistency, etc.. But from a theoretical stand point, moving it to be >>> sized dynamically seems like a much better solution, IMO. >> >> Several problems with this approach: >> >> 1) unpredictability - when do trigger it and by what criteria? > > This would be need to be ironed out, but I was thinking that if the max > chain length grows above X, then grow the hash. What the value of X is > could be a tunable with some default value that is determined through > testing certain workloads. > >> 2) the extreme expense of rehashing everything (easily dozens of >> seconds of one CPU pegged at 100% while everything else grinds >> to a halt as the ARC is inaccessible) > > Without doing anything clever, yes. You would have to lock the entire > hash table, allocate the new table, and then move everything over. > > I was hoping there might be a clever way to do this in the background, > so as not to affect performance "too much". > > Thinking out loud here, but maybe something incremental. For example, a > background thread rehashes things one bucket at a time, and uses a index > to indicate if a bucket is valid or not. So.. > > read -> hash to index i of old/current table > if (bg_thread_is_rehashing && i < rehash_index) > rehash buffer to get index j of new table > look for buffer in new table using index j > else > look for buffer in old/current table using index i > > And the background thread will need to lock each bucket, rehash the > buffers into the new table, increment the "rehash_index", unlock the bucket > and then repeat until all buckets are rehashed. > > Granted, I haven't given this much thought, but if it's possible to do > in an unobtrusive way, I think it's worth pursuing.
Have you measured the gains from this? Seriously, what you're proposing is a hugely complicated incision into how ARC hash tables work for very questionable gains. The potential pitfalls of auto-tuning are even harder to predict for various workloads. >> The easiest approach is just to give the admin a knob which they can >> twiddle at boot and then leave alone. If performance is really a >> problem, they can schedule downtime. Doing it at runtime has its >> problems and doing it automatically is dangerous. > > I agree that easiest approach is to just statically size the hash table, > but that doesn't really solve the problem. Rather we're just kicking the > can down the road, IMO. We're *NOT* kicking the can down the road. The static hash table has known fixed scaling properties and none of them go off the rails with hardware growth. > If the completely dynamic approach isn't tractable, why split the table > into a 2D array? Why not just increase the size of it, and keep it a 1D > array? The reason why I split it was to reduce a single allocation size. The point is that on bigmem machines (with 1TB+ of physical DRAM) these allocations can grow to ridiculous sizes (1GB+). If the virtual address space is sufficiently fragmented this can create trouble. So far, at least, this is my hypothesis. If it's invalid, I will happily revert the code back to a 1D table, but so far we haven't been able to get VM experts to comment on this. I have seen one edge case where the kernel, under memory pressure, failed to allocate a contiguous 128k buffer - whether the scenario applies generally, though, I'm not certain. Again, not a VM expert. Cheers, -- Saso _______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
