Re: [OpenZFS Developer] [zfs] Re: [Review] #3525 Persistent L2ARC - UPDATED

Saso Kiselkov Wed, 16 Oct 2013 10:42:27 -0700

On 10/16/13 6:27 PM, Prakash Surya wrote:
> On Wed, Oct 16, 2013 at 02:57:08AM +0100, Saso Kiselkov wrote:
>> On 10/16/13 2:42 AM, Prakash Surya wrote:
>>> OK, that is where I assumed the speed up was coming from (shorter
>>> chains leading to faster lookups).
>>>
>>> I also assumed there would be a "bucket lock" that needs to be acquired
>>> during this lookup similar to the dbuf hash (which would affect
>>> concurrency), but I guess that isn't the case here (I haven't studied
>>> the arc hash code as well as I have the dbuf hash code).
>>
>> It's mostly the same. Dbuf would also benefit a bit from restructuring,
>> perhaps.
>>
>>> So, if this simply comes down to a hash collision issue, can't we try
>>> and take this a bit further.. Can we make the hash size be completely
>>> dynamic? Instead of using another heuristic, can we grow (and shrink?)
>>> the hash as the workload demands? So if the max chain depth reaches a
>>> threshold, we increase the number of hash buckets (e.g. double it).
>>>
>>> Of course the details of performing the grow (i.e. rehash) operation
>>> would need to be worked out so it doesn't affect performance,
>>> consistency, etc.. But from a theoretical stand point, moving it to be
>>> sized dynamically seems like a much better solution, IMO.
>>
>> Several problems with this approach:
>>
>> 1) unpredictability - when do trigger it and by what criteria?
> 
> This would be need to be ironed out, but I was thinking that if the max
> chain length grows above X, then grow the hash. What the value of X is
> could be a tunable with some default value that is determined through
> testing certain workloads.
> 
>> 2) the extreme expense of rehashing everything (easily dozens of
>>    seconds of one CPU pegged at 100% while everything else grinds
>>    to a halt as the ARC is inaccessible)
> 
> Without doing anything clever, yes. You would have to lock the entire
> hash table, allocate the new table, and then move everything over.
> 
> I was hoping there might be a clever way to do this in the background,
> so as not to affect performance "too much".
> 
> Thinking out loud here, but maybe something incremental. For example, a
> background thread rehashes things one bucket at a time, and uses a index
> to indicate if a bucket is valid or not. So..
> 
>     read -> hash to index i of old/current table
>     if (bg_thread_is_rehashing && i < rehash_index)
>         rehash buffer to get index j of new table
>         look for buffer in new table using index j
>     else
>         look for buffer in old/current table using index i
> 
> And the background thread will need to lock each bucket, rehash the
> buffers into the new table, increment the "rehash_index", unlock the bucket
> and then repeat until all buckets are rehashed.
> 
> Granted, I haven't given this much thought, but if it's possible to do
> in an unobtrusive way, I think it's worth pursuing.


Have you measured the gains from this? Seriously, what you're proposing
is a hugely complicated incision into how ARC hash tables work for very
questionable gains. The potential pitfalls of auto-tuning are even
harder to predict for various workloads.

>> The easiest approach is just to give the admin a knob which they can
>> twiddle at boot and then leave alone. If performance is really a
>> problem, they can schedule downtime. Doing it at runtime has its
>> problems and doing it automatically is dangerous.
> 
> I agree that easiest approach is to just statically size the hash table,
> but that doesn't really solve the problem. Rather we're just kicking the
> can down the road, IMO.

We're *NOT* kicking the can down the road. The static hash table has
known fixed scaling properties and none of them go off the rails with
hardware growth.

> If the completely dynamic approach isn't tractable, why split the table
> into a 2D array? Why not just increase the size of it, and keep it a 1D
> array?

The reason why I split it was to reduce a single allocation size. The
point is that on bigmem machines (with 1TB+ of physical DRAM) these
allocations can grow to ridiculous sizes (1GB+). If the virtual address
space is sufficiently fragmented this can create trouble. So far, at
least, this is my hypothesis. If it's invalid, I will happily revert the
code back to a 1D table, but so far we haven't been able to get VM
experts to comment on this. I have seen one edge case where the kernel,
under memory pressure, failed to allocate a contiguous 128k buffer -
whether the scenario applies generally, though, I'm not certain. Again,
not a VM expert.

Cheers,
-- 
Saso
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [OpenZFS Developer] [zfs] Re: [Review] #3525 Persistent L2ARC - UPDATED

Reply via email to