[pkg-discuss] New layout for download cache and files directory for depo

Brock Pytlik Thu, 20 Aug 2009 16:49:01 -0700

I've spent some time over the last couple of weeks running tests onvarious directory layouts for the download cache and files directory onthe server, and while there still are some issues to be understood, Ithink we have enough information to make a decision and move ahead. Theplan is to use a fanout of 32768 (called 32k from now on).Experimentally, a fan out of 32k seems to perform as well as puttingeverything in a flat directory out to at least 7.5M files. (Forreference, our dev repository current has about 1M files in it.) 32kalso keeps us within the limits imposed by other file systems, allowingus to at least start off by sharing the code path for all platforms. Ialso plan to make the number of levels of directories based on the totalnumber of files being stored. For example, if less than 32k files arebeing stored, all files will go in the same directory (we'll call thisone level of directories). While the number of files lies between 32kand 32k^2 (approximately 1 billion), all files will live under 2 levelsof directories. In general, the number of levels will be determined by:

ceiling(log_32k(total number of files))

I believe this arrangement will provide good performance on OpenSolarisacross clients and servers, and won't break clients (or servers) runningon other platforms. Also, to facilitate a 32k fanout, we'll switch to abase 32 encoding for the files so that each directory will use 3 lettersof the file name.

For those wondering why 32k, I plan to post a blog entry that has muchmore detail, and includes the charts and graphs which show some of theissues with smaller fanouts. It also keeps us below the 65k limit that Iunderstand that some file systems have.

Per dp's suggestion, I'm planning on also having a class (for now calledCacheManager) to abstract all the handling of the cache and filesdirectories. I'm imagining the interface will roughly be:

For retrieval: given a hash value, return a file handle for that file

For insertion: the path to the original file, it computes the hash andinserts the file into the structure. We might optionally decide to allowthe caller to pass in the hash if we find we're calculating the hashmultiple times.

I imagine removal would take either a file, or a hash, to be removed,but we don't currently support removing selective files from either thefiles directory or the download cache. There will likely be a flushcommand to simply empty the store as well.

I suspect that for debugging/manual manipulation of a depo/imagepurposes, we may want to create a new pkg subcommand which given a hashor file, finds the location of the file in the directory structure.

Because we're going to change hashing algorithms in the future, andbecause the size of a cache or repo may change, I plan to allow theCacheManager to keep a historical list of the hash algorithms used andthe layout size at that time. This will prevent a large migration hiteach time we change the either the hash algorithm, or the structuregrows beyond the current threshold. When looking for a file, theCacheManager will check each predicted structure in order, until itfinds the file. It will then move the file to where the current systemsays it should be. There will likely also be an option to force themigration of every file should the user desire it.

To handle different hash algorithms, a abbreviation indicating thealgorithm used will be attached to the end of the file name.

I think that covers everything, but I'd appreciate any feedback orcomments that come to mind.


Thanks,
Brock
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

[pkg-discuss] New layout for download cache and files directory for depo

Reply via email to