I agree that assuming 64K average block size is a bad idea.  We should
assume smaller average block size, and also allow for this to be overridden
in /etc/system -- as you have done.

Are these changes linked with the persistent L2ARC work?  Could we first
integrate the hash table size changes and evaluate persistency separately?

> 3.What happens when the hash table size is excessively large, as happens
>  on bigmem systems?

Indeed, what happens?  I would guess that it works just fine, especially
given that this will be executed early in boot, when nearly all of the
address space is available.  Do you have evidence that kmem_alloc(1GB) does
not work?

--matt


On Sat, Oct 12, 2013 at 4:21 PM, Saso Kiselkov <[email protected]>wrote:

> While benchmarking the persistent L2ARC implementation I hit upon a
> "slight" snag with smaller block sizes and rebuild times. In essence, if
> you have many small blocks in your L2ARC (e.g. 8k due to hosting zvols),
> rebuild can take forever to complete (tens of seconds). The reason has
> to do with the ARC hash table implementation, so first a bit of background:
>
> The ARC holds blocks in a hash table. No surprises there. Aside from the
> obvious small room for improvement by marking all very hot hash table
> manipulation functions as "inline" there are bigger problems that need
> to be addressed. First of all, how does one size the hash table? Small
> hash tables take up less space, but then result in more collisions and
> longer bucket chains (which hurts performance). Larger hash tables have
> better performance but are (doh) larger. I even went and tried switching
> the implementation to AVL trees (moderate performance improvement in
> some places, moderate performance decrease in others).
>
> The current implementation is hardcoded guesswork as to the correct hash
> table size. In principle, it works by taking the amount of physical
> memory and dividing it by a 64k block size. The result is the amount of
> hash buckets that will be created. On 64-bit machines this comes out to
> roughly 128kB of hash tables for each 1 GB of physical memory. This
> approach has obvious flaws:
>
> 1.What if we need to cache many smaller blocks? We'll get excessively
>   long bucket chains (see the hash_chain_max and hash_collisions kstats
>   on your systems).
>
> 2.What if we need to cache a lot more blocks than fit into main memory?
>   The L2ARC needs to keep its ARC headers around.
>
> 3.What happens when the hash table size is excessively large, as happens
>   on bigmem systems? E.g. on a system with 1TB of physical memory (which
>   can be easily built nowadays, and it's only going to get cheaper) the
>   hash table will be a contiguous chunk of memory some 128MB in size.
>   If we try to combat problems #1 & #2 by making the hash table larger,
>   we exacerbate problem #3.
>
> My solutions to each of the problems are, (implemented in the webrev to
> issue #3525):
>
> 1: Changed the sizing algorithm to use a kernel tunable. This allows
>    administrators with specific workloads to change the hash table size
>    according to their needs without having to build custom kernel
>    modules.
>
> 2: Increased the hash table size so as to run at 1MB for every 1GB of
>    main memory. We can easily afford to lose 0.1% of memory capacity
>    considering the performance gains (outlined below), especially
>    since today many systems use L2ARC as an integral part of their
>    storage infrastructure.
>
> 3: Switch to a 2D hash table. This means that the largest contiguous
>    chunk of memory will be at most sqrt(hash_table_size).
>
> Webrev: http://cr.illumos.org/~webrev/skiselkov/3525_simplified/
>
> Overall, the above changes boil down to the following hash table size
> numbers (from small to large systems):
>
> Physical RAM    HT Size         HT Dimensions
> -------------------------------------------------
> 4 GB            4 MB            2048x     4k cols
> 128 GB          128 MB          4096x    32k cols
> 1 TB            1 GB            32768x  256k cols
> 16 TB           16 GB           262144x   2M cols
>
> (For the actual algorithm see arc.c in the webrev.)
>
> The performance gains from this are pretty substantial in my testing.
> The time needed to rebuild 93 GB worth of ARC buffers consisting of a
> mixture of 128k and 8k blocks (total number of ARC buffers: 6.4 million,
> or ~15.4k average block size) decreased from ~50s to about 9.5s (with a
> small portion of that being physical L2ARC read latency).
>
> I apologize for this long litany of an email and look forward to your
> comments and reviews.
>
> Cheers,
> --
> Saso
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed:
> https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
> Modify Your Subscription:
> https://www.listbox.com/member/?member_id=21635000&id_secret=21635000-73dc201a
> Powered by Listbox: http://www.listbox.com
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to