Re: [OpenZFS Developer] [zfs] Re: [Review] #3525 Persistent L2ARC - UPDATED

Prakash Surya Wed, 16 Oct 2013 13:15:00 -0700

On Wed, Oct 16, 2013 at 12:57:18PM -0700, Prakash Surya wrote:
> On Wed, Oct 16, 2013 at 06:40:12PM +0100, Saso Kiselkov wrote:
> > On 10/16/13 6:27 PM, Prakash Surya wrote:
> > > On Wed, Oct 16, 2013 at 02:57:08AM +0100, Saso Kiselkov wrote:
> > >> On 10/16/13 2:42 AM, Prakash Surya wrote:
> > >>> OK, that is where I assumed the speed up was coming from (shorter
> > >>> chains leading to faster lookups).
> > >>>
> > >>> I also assumed there would be a "bucket lock" that needs to be acquired
> > >>> during this lookup similar to the dbuf hash (which would affect
> > >>> concurrency), but I guess that isn't the case here (I haven't studied
> > >>> the arc hash code as well as I have the dbuf hash code).
> > >>
> > >> It's mostly the same. Dbuf would also benefit a bit from restructuring,
> > >> perhaps.
> > >>
> > >>> So, if this simply comes down to a hash collision issue, can't we try
> > >>> and take this a bit further.. Can we make the hash size be completely
> > >>> dynamic? Instead of using another heuristic, can we grow (and shrink?)
> > >>> the hash as the workload demands? So if the max chain depth reaches a
> > >>> threshold, we increase the number of hash buckets (e.g. double it).
> > >>>
> > >>> Of course the details of performing the grow (i.e. rehash) operation
> > >>> would need to be worked out so it doesn't affect performance,
> > >>> consistency, etc.. But from a theoretical stand point, moving it to be
> > >>> sized dynamically seems like a much better solution, IMO.
> > >>
> > >> Several problems with this approach:
> > >>
> > >> 1) unpredictability - when do trigger it and by what criteria?
> > > 
> > > This would be need to be ironed out, but I was thinking that if the max
> > > chain length grows above X, then grow the hash. What the value of X is
> > > could be a tunable with some default value that is determined through
> > > testing certain workloads.
> > > 
> > >> 2) the extreme expense of rehashing everything (easily dozens of
> > >>    seconds of one CPU pegged at 100% while everything else grinds
> > >>    to a halt as the ARC is inaccessible)
> > > 
> > > Without doing anything clever, yes. You would have to lock the entire
> > > hash table, allocate the new table, and then move everything over.
> > > 
> > > I was hoping there might be a clever way to do this in the background,
> > > so as not to affect performance "too much".
> > > 
> > > Thinking out loud here, but maybe something incremental. For example, a
> > > background thread rehashes things one bucket at a time, and uses a index
> > > to indicate if a bucket is valid or not. So..
> > > 
> > >     read -> hash to index i of old/current table
> > >     if (bg_thread_is_rehashing && i < rehash_index)
> > >         rehash buffer to get index j of new table
> > >         look for buffer in new table using index j
> > >     else
> > >         look for buffer in old/current table using index i
> > > 
> > > And the background thread will need to lock each bucket, rehash the
> > > buffers into the new table, increment the "rehash_index", unlock the 
> > > bucket
> > > and then repeat until all buckets are rehashed.
> > > 
> > > Granted, I haven't given this much thought, but if it's possible to do
> > > in an unobtrusive way, I think it's worth pursuing.
> > 
> > Have you measured the gains from this? Seriously, what you're proposing
> > is a hugely complicated incision into how ARC hash tables work for very
> > questionable gains. The potential pitfalls of auto-tuning are even
> > harder to predict for various workloads.
> 
> No, I haven't tested any of what I was proposing. I just thought I'd
> bring the idea up and see if it would survive.
> 
> > 
> > >> The easiest approach is just to give the admin a knob which they can
> > >> twiddle at boot and then leave alone. If performance is really a
> > >> problem, they can schedule downtime. Doing it at runtime has its
> > >> problems and doing it automatically is dangerous.
> > > 
> > > I agree that easiest approach is to just statically size the hash table,
> > > but that doesn't really solve the problem. Rather we're just kicking the
> > > can down the road, IMO.
> > 
> > We're *NOT* kicking the can down the road. The static hash table has
> > known fixed scaling properties and none of them go off the rails with
> > hardware growth.
> 
> Well, we're just increasing the size of the hash. So, given enough RAM,
> the problem of having enough buffers in the cache to cause large chains
> still exists. Right? Although, "enough RAM" might not be practically
> achievable.


Actually, I think I've convinced myself I was wrong here. It should
scale well with increased amounts of RAM, due to the table's size being
proportional to the RAM size. It's the case when using more/smaller
buffers with the same amount of RAM which could cause worse performance.

-- 
Cheers, Prakash

> 
> > 
> > > If the completely dynamic approach isn't tractable, why split the table
> > > into a 2D array? Why not just increase the size of it, and keep it a 1D
> > > array?
> > 
> > The reason why I split it was to reduce a single allocation size. The
> > point is that on bigmem machines (with 1TB+ of physical DRAM) these
> > allocations can grow to ridiculous sizes (1GB+). If the virtual address
> > space is sufficiently fragmented this can create trouble. So far, at
> > least, this is my hypothesis. If it's invalid, I will happily revert the
> > code back to a 1D table, but so far we haven't been able to get VM
> > experts to comment on this. I have seen one edge case where the kernel,
> > under memory pressure, failed to allocate a contiguous 128k buffer -
> > whether the scenario applies generally, though, I'm not certain. Again,
> > not a VM expert.
> 
> I'm no VM expert either, but speaking from Linux's perspective, I don't
> think virtual address space fragmentation is much of an issue. AFAIK,
> whether you're doing a 1M vmalloc or 1G vmalloc, VM fragmentation
> doesn't play much of an issue. The kernel will allocate non-contiguous
> pages and then present them as a contiguous region, so you just need
> enough free pages on the system to satisfy the request.
> 
> I should try and prod behlendorf about this, since he has much more
> experience on the subject than I do.
> 
> -- 
> Cheers, Prakash
> 
> > 
> > Cheers,
> > -- 
> > Saso
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [OpenZFS Developer] [zfs] Re: [Review] #3525 Persistent L2ARC - UPDATED

Reply via email to