While benchmarking the persistent L2ARC implementation I hit upon a "slight" snag with smaller block sizes and rebuild times. In essence, if you have many small blocks in your L2ARC (e.g. 8k due to hosting zvols), rebuild can take forever to complete (tens of seconds). The reason has to do with the ARC hash table implementation, so first a bit of background:
The ARC holds blocks in a hash table. No surprises there. Aside from the obvious small room for improvement by marking all very hot hash table manipulation functions as "inline" there are bigger problems that need to be addressed. First of all, how does one size the hash table? Small hash tables take up less space, but then result in more collisions and longer bucket chains (which hurts performance). Larger hash tables have better performance but are (doh) larger. I even went and tried switching the implementation to AVL trees (moderate performance improvement in some places, moderate performance decrease in others). The current implementation is hardcoded guesswork as to the correct hash table size. In principle, it works by taking the amount of physical memory and dividing it by a 64k block size. The result is the amount of hash buckets that will be created. On 64-bit machines this comes out to roughly 128kB of hash tables for each 1 GB of physical memory. This approach has obvious flaws: 1.What if we need to cache many smaller blocks? We'll get excessively long bucket chains (see the hash_chain_max and hash_collisions kstats on your systems). 2.What if we need to cache a lot more blocks than fit into main memory? The L2ARC needs to keep its ARC headers around. 3.What happens when the hash table size is excessively large, as happens on bigmem systems? E.g. on a system with 1TB of physical memory (which can be easily built nowadays, and it's only going to get cheaper) the hash table will be a contiguous chunk of memory some 128MB in size. If we try to combat problems #1 & #2 by making the hash table larger, we exacerbate problem #3. My solutions to each of the problems are, (implemented in the webrev to issue #3525): 1: Changed the sizing algorithm to use a kernel tunable. This allows administrators with specific workloads to change the hash table size according to their needs without having to build custom kernel modules. 2: Increased the hash table size so as to run at 1MB for every 1GB of main memory. We can easily afford to lose 0.1% of memory capacity considering the performance gains (outlined below), especially since today many systems use L2ARC as an integral part of their storage infrastructure. 3: Switch to a 2D hash table. This means that the largest contiguous chunk of memory will be at most sqrt(hash_table_size). Webrev: http://cr.illumos.org/~webrev/skiselkov/3525_simplified/ Overall, the above changes boil down to the following hash table size numbers (from small to large systems): Physical RAM HT Size HT Dimensions ------------------------------------------------- 4 GB 4 MB 2048x 4k cols 128 GB 128 MB 4096x 32k cols 1 TB 1 GB 32768x 256k cols 16 TB 16 GB 262144x 2M cols (For the actual algorithm see arc.c in the webrev.) The performance gains from this are pretty substantial in my testing. The time needed to rebuild 93 GB worth of ARC buffers consisting of a mixture of 128k and 8k blocks (total number of ARC buffers: 6.4 million, or ~15.4k average block size) decreased from ~50s to about 9.5s (with a small portion of that being physical L2ARC read latency). I apologize for this long litany of an email and look forward to your comments and reviews. Cheers, -- Saso _______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
