On Wed, Oct 16, 2013 at 12:57:18PM -0700, Prakash Surya wrote: > On Wed, Oct 16, 2013 at 06:40:12PM +0100, Saso Kiselkov wrote: > > On 10/16/13 6:27 PM, Prakash Surya wrote: > > > On Wed, Oct 16, 2013 at 02:57:08AM +0100, Saso Kiselkov wrote: > > >> On 10/16/13 2:42 AM, Prakash Surya wrote: > > >>> OK, that is where I assumed the speed up was coming from (shorter > > >>> chains leading to faster lookups). > > >>> > > >>> I also assumed there would be a "bucket lock" that needs to be acquired > > >>> during this lookup similar to the dbuf hash (which would affect > > >>> concurrency), but I guess that isn't the case here (I haven't studied > > >>> the arc hash code as well as I have the dbuf hash code). > > >> > > >> It's mostly the same. Dbuf would also benefit a bit from restructuring, > > >> perhaps. > > >> > > >>> So, if this simply comes down to a hash collision issue, can't we try > > >>> and take this a bit further.. Can we make the hash size be completely > > >>> dynamic? Instead of using another heuristic, can we grow (and shrink?) > > >>> the hash as the workload demands? So if the max chain depth reaches a > > >>> threshold, we increase the number of hash buckets (e.g. double it). > > >>> > > >>> Of course the details of performing the grow (i.e. rehash) operation > > >>> would need to be worked out so it doesn't affect performance, > > >>> consistency, etc.. But from a theoretical stand point, moving it to be > > >>> sized dynamically seems like a much better solution, IMO. > > >> > > >> Several problems with this approach: > > >> > > >> 1) unpredictability - when do trigger it and by what criteria? > > > > > > This would be need to be ironed out, but I was thinking that if the max > > > chain length grows above X, then grow the hash. What the value of X is > > > could be a tunable with some default value that is determined through > > > testing certain workloads. > > > > > >> 2) the extreme expense of rehashing everything (easily dozens of > > >> seconds of one CPU pegged at 100% while everything else grinds > > >> to a halt as the ARC is inaccessible) > > > > > > Without doing anything clever, yes. You would have to lock the entire > > > hash table, allocate the new table, and then move everything over. > > > > > > I was hoping there might be a clever way to do this in the background, > > > so as not to affect performance "too much". > > > > > > Thinking out loud here, but maybe something incremental. For example, a > > > background thread rehashes things one bucket at a time, and uses a index > > > to indicate if a bucket is valid or not. So.. > > > > > > read -> hash to index i of old/current table > > > if (bg_thread_is_rehashing && i < rehash_index) > > > rehash buffer to get index j of new table > > > look for buffer in new table using index j > > > else > > > look for buffer in old/current table using index i > > > > > > And the background thread will need to lock each bucket, rehash the > > > buffers into the new table, increment the "rehash_index", unlock the > > > bucket > > > and then repeat until all buckets are rehashed. > > > > > > Granted, I haven't given this much thought, but if it's possible to do > > > in an unobtrusive way, I think it's worth pursuing. > > > > Have you measured the gains from this? Seriously, what you're proposing > > is a hugely complicated incision into how ARC hash tables work for very > > questionable gains. The potential pitfalls of auto-tuning are even > > harder to predict for various workloads. > > No, I haven't tested any of what I was proposing. I just thought I'd > bring the idea up and see if it would survive. > > > > > >> The easiest approach is just to give the admin a knob which they can > > >> twiddle at boot and then leave alone. If performance is really a > > >> problem, they can schedule downtime. Doing it at runtime has its > > >> problems and doing it automatically is dangerous. > > > > > > I agree that easiest approach is to just statically size the hash table, > > > but that doesn't really solve the problem. Rather we're just kicking the > > > can down the road, IMO. > > > > We're *NOT* kicking the can down the road. The static hash table has > > known fixed scaling properties and none of them go off the rails with > > hardware growth. > > Well, we're just increasing the size of the hash. So, given enough RAM, > the problem of having enough buffers in the cache to cause large chains > still exists. Right? Although, "enough RAM" might not be practically > achievable.
Actually, I think I've convinced myself I was wrong here. It should scale well with increased amounts of RAM, due to the table's size being proportional to the RAM size. It's the case when using more/smaller buffers with the same amount of RAM which could cause worse performance. -- Cheers, Prakash > > > > > > If the completely dynamic approach isn't tractable, why split the table > > > into a 2D array? Why not just increase the size of it, and keep it a 1D > > > array? > > > > The reason why I split it was to reduce a single allocation size. The > > point is that on bigmem machines (with 1TB+ of physical DRAM) these > > allocations can grow to ridiculous sizes (1GB+). If the virtual address > > space is sufficiently fragmented this can create trouble. So far, at > > least, this is my hypothesis. If it's invalid, I will happily revert the > > code back to a 1D table, but so far we haven't been able to get VM > > experts to comment on this. I have seen one edge case where the kernel, > > under memory pressure, failed to allocate a contiguous 128k buffer - > > whether the scenario applies generally, though, I'm not certain. Again, > > not a VM expert. > > I'm no VM expert either, but speaking from Linux's perspective, I don't > think virtual address space fragmentation is much of an issue. AFAIK, > whether you're doing a 1M vmalloc or 1G vmalloc, VM fragmentation > doesn't play much of an issue. The kernel will allocate non-contiguous > pages and then present them as a contiguous region, so you just need > enough free pages on the system to satisfy the request. > > I should try and prod behlendorf about this, since he has much more > experience on the subject than I do. > > -- > Cheers, Prakash > > > > > Cheers, > > -- > > Saso > _______________________________________________ > developer mailing list > [email protected] > http://lists.open-zfs.org/mailman/listinfo/developer _______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
