Re: [OpenZFS Developer] [zfs] Re: [Review] #3525 Persistent L2ARC - UPDATED

Prakash Surya Wed, 16 Oct 2013 12:59:36 -0700

On Wed, Oct 16, 2013 at 06:40:12PM +0100, Saso Kiselkov wrote:
> On 10/16/13 6:27 PM, Prakash Surya wrote:
> > On Wed, Oct 16, 2013 at 02:57:08AM +0100, Saso Kiselkov wrote:
> >> On 10/16/13 2:42 AM, Prakash Surya wrote:
> >>> OK, that is where I assumed the speed up was coming from (shorter
> >>> chains leading to faster lookups).
> >>>
> >>> I also assumed there would be a "bucket lock" that needs to be acquired
> >>> during this lookup similar to the dbuf hash (which would affect
> >>> concurrency), but I guess that isn't the case here (I haven't studied
> >>> the arc hash code as well as I have the dbuf hash code).
> >>
> >> It's mostly the same. Dbuf would also benefit a bit from restructuring,
> >> perhaps.
> >>
> >>> So, if this simply comes down to a hash collision issue, can't we try
> >>> and take this a bit further.. Can we make the hash size be completely
> >>> dynamic? Instead of using another heuristic, can we grow (and shrink?)
> >>> the hash as the workload demands? So if the max chain depth reaches a
> >>> threshold, we increase the number of hash buckets (e.g. double it).
> >>>
> >>> Of course the details of performing the grow (i.e. rehash) operation
> >>> would need to be worked out so it doesn't affect performance,
> >>> consistency, etc.. But from a theoretical stand point, moving it to be
> >>> sized dynamically seems like a much better solution, IMO.
> >>
> >> Several problems with this approach:
> >>
> >> 1) unpredictability - when do trigger it and by what criteria?
> > 
> > This would be need to be ironed out, but I was thinking that if the max
> > chain length grows above X, then grow the hash. What the value of X is
> > could be a tunable with some default value that is determined through
> > testing certain workloads.
> > 
> >> 2) the extreme expense of rehashing everything (easily dozens of
> >>    seconds of one CPU pegged at 100% while everything else grinds
> >>    to a halt as the ARC is inaccessible)
> > 
> > Without doing anything clever, yes. You would have to lock the entire
> > hash table, allocate the new table, and then move everything over.
> > 
> > I was hoping there might be a clever way to do this in the background,
> > so as not to affect performance "too much".
> > 
> > Thinking out loud here, but maybe something incremental. For example, a
> > background thread rehashes things one bucket at a time, and uses a index
> > to indicate if a bucket is valid or not. So..
> > 
> >     read -> hash to index i of old/current table
> >     if (bg_thread_is_rehashing && i < rehash_index)
> >         rehash buffer to get index j of new table
> >         look for buffer in new table using index j
> >     else
> >         look for buffer in old/current table using index i
> > 
> > And the background thread will need to lock each bucket, rehash the
> > buffers into the new table, increment the "rehash_index", unlock the bucket
> > and then repeat until all buckets are rehashed.
> > 
> > Granted, I haven't given this much thought, but if it's possible to do
> > in an unobtrusive way, I think it's worth pursuing.
> 
> Have you measured the gains from this? Seriously, what you're proposing
> is a hugely complicated incision into how ARC hash tables work for very
> questionable gains. The potential pitfalls of auto-tuning are even
> harder to predict for various workloads.


No, I haven't tested any of what I was proposing. I just thought I'd
bring the idea up and see if it would survive.

> 
> >> The easiest approach is just to give the admin a knob which they can
> >> twiddle at boot and then leave alone. If performance is really a
> >> problem, they can schedule downtime. Doing it at runtime has its
> >> problems and doing it automatically is dangerous.
> > 
> > I agree that easiest approach is to just statically size the hash table,
> > but that doesn't really solve the problem. Rather we're just kicking the
> > can down the road, IMO.
> 
> We're *NOT* kicking the can down the road. The static hash table has
> known fixed scaling properties and none of them go off the rails with
> hardware growth.

Well, we're just increasing the size of the hash. So, given enough RAM,
the problem of having enough buffers in the cache to cause large chains
still exists. Right? Although, "enough RAM" might not be practically
achievable.

> 
> > If the completely dynamic approach isn't tractable, why split the table
> > into a 2D array? Why not just increase the size of it, and keep it a 1D
> > array?
> 
> The reason why I split it was to reduce a single allocation size. The
> point is that on bigmem machines (with 1TB+ of physical DRAM) these
> allocations can grow to ridiculous sizes (1GB+). If the virtual address
> space is sufficiently fragmented this can create trouble. So far, at
> least, this is my hypothesis. If it's invalid, I will happily revert the
> code back to a 1D table, but so far we haven't been able to get VM
> experts to comment on this. I have seen one edge case where the kernel,
> under memory pressure, failed to allocate a contiguous 128k buffer -
> whether the scenario applies generally, though, I'm not certain. Again,
> not a VM expert.

I'm no VM expert either, but speaking from Linux's perspective, I don't
think virtual address space fragmentation is much of an issue. AFAIK,
whether you're doing a 1M vmalloc or 1G vmalloc, VM fragmentation
doesn't play much of an issue. The kernel will allocate non-contiguous
pages and then present them as a contiguous region, so you just need
enough free pages on the system to satisfy the request.

I should try and prod behlendorf about this, since he has much more
experience on the subject than I do.

-- 
Cheers, Prakash

> 
> Cheers,
> -- 
> Saso
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [OpenZFS Developer] [zfs] Re: [Review] #3525 Persistent L2ARC - UPDATED

Reply via email to