While benchmarking the persistent L2ARC implementation I hit upon a
"slight" snag with smaller block sizes and rebuild times. In essence, if
you have many small blocks in your L2ARC (e.g. 8k due to hosting zvols),
rebuild can take forever to complete (tens of seconds). The reason has
to do with the ARC hash table implementation, so first a bit of background:

The ARC holds blocks in a hash table. No surprises there. Aside from the
obvious small room for improvement by marking all very hot hash table
manipulation functions as "inline" there are bigger problems that need
to be addressed. First of all, how does one size the hash table? Small
hash tables take up less space, but then result in more collisions and
longer bucket chains (which hurts performance). Larger hash tables have
better performance but are (doh) larger. I even went and tried switching
the implementation to AVL trees (moderate performance improvement in
some places, moderate performance decrease in others).

The current implementation is hardcoded guesswork as to the correct hash
table size. In principle, it works by taking the amount of physical
memory and dividing it by a 64k block size. The result is the amount of
hash buckets that will be created. On 64-bit machines this comes out to
roughly 128kB of hash tables for each 1 GB of physical memory. This
approach has obvious flaws:

1.What if we need to cache many smaller blocks? We'll get excessively
  long bucket chains (see the hash_chain_max and hash_collisions kstats
  on your systems).

2.What if we need to cache a lot more blocks than fit into main memory?
  The L2ARC needs to keep its ARC headers around.

3.What happens when the hash table size is excessively large, as happens
  on bigmem systems? E.g. on a system with 1TB of physical memory (which
  can be easily built nowadays, and it's only going to get cheaper) the
  hash table will be a contiguous chunk of memory some 128MB in size.
  If we try to combat problems #1 & #2 by making the hash table larger,
  we exacerbate problem #3.

My solutions to each of the problems are, (implemented in the webrev to
issue #3525):

1: Changed the sizing algorithm to use a kernel tunable. This allows
   administrators with specific workloads to change the hash table size
   according to their needs without having to build custom kernel
   modules.

2: Increased the hash table size so as to run at 1MB for every 1GB of
   main memory. We can easily afford to lose 0.1% of memory capacity
   considering the performance gains (outlined below), especially
   since today many systems use L2ARC as an integral part of their
   storage infrastructure.

3: Switch to a 2D hash table. This means that the largest contiguous
   chunk of memory will be at most sqrt(hash_table_size).

Webrev: http://cr.illumos.org/~webrev/skiselkov/3525_simplified/

Overall, the above changes boil down to the following hash table size
numbers (from small to large systems):

Physical RAM    HT Size         HT Dimensions
-------------------------------------------------
4 GB            4 MB            2048x     4k cols
128 GB          128 MB          4096x    32k cols
1 TB            1 GB            32768x  256k cols
16 TB           16 GB           262144x   2M cols

(For the actual algorithm see arc.c in the webrev.)

The performance gains from this are pretty substantial in my testing.
The time needed to rebuild 93 GB worth of ARC buffers consisting of a
mixture of 128k and 8k blocks (total number of ARC buffers: 6.4 million,
or ~15.4k average block size) decreased from ~50s to about 9.5s (with a
small portion of that being physical L2ARC read latency).

I apologize for this long litany of an email and look forward to your
comments and reviews.

Cheers,
-- 
Saso
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to