I've spent some time over the last couple of weeks running tests on various directory layouts for the download cache and files directory on the server, and while there still are some issues to be understood, I think we have enough information to make a decision and move ahead. The plan is to use a fanout of 32768 (called 32k from now on). Experimentally, a fan out of 32k seems to perform as well as putting everything in a flat directory out to at least 7.5M files. (For reference, our dev repository current has about 1M files in it.) 32k also keeps us within the limits imposed by other file systems, allowing us to at least start off by sharing the code path for all platforms. I also plan to make the number of levels of directories based on the total number of files being stored. For example, if less than 32k files are being stored, all files will go in the same directory (we'll call this one level of directories). While the number of files lies between 32k and 32k^2 (approximately 1 billion), all files will live under 2 levels of directories. In general, the number of levels will be determined by:
ceiling(log_32k(total number of files))

I believe this arrangement will provide good performance on OpenSolaris across clients and servers, and won't break clients (or servers) running on other platforms. Also, to facilitate a 32k fanout, we'll switch to a base 32 encoding for the files so that each directory will use 3 letters of the file name.

For those wondering why 32k, I plan to post a blog entry that has much more detail, and includes the charts and graphs which show some of the issues with smaller fanouts. It also keeps us below the 65k limit that I understand that some file systems have.

Per dp's suggestion, I'm planning on also having a class (for now called CacheManager) to abstract all the handling of the cache and files directories. I'm imagining the interface will roughly be:
For retrieval: given a hash value, return a file handle for that file

For insertion: the path to the original file, it computes the hash and inserts the file into the structure. We might optionally decide to allow the caller to pass in the hash if we find we're calculating the hash multiple times.

I imagine removal would take either a file, or a hash, to be removed, but we don't currently support removing selective files from either the files directory or the download cache. There will likely be a flush command to simply empty the store as well.

I suspect that for debugging/manual manipulation of a depo/image purposes, we may want to create a new pkg subcommand which given a hash or file, finds the location of the file in the directory structure.

Because we're going to change hashing algorithms in the future, and because the size of a cache or repo may change, I plan to allow the CacheManager to keep a historical list of the hash algorithms used and the layout size at that time. This will prevent a large migration hit each time we change the either the hash algorithm, or the structure grows beyond the current threshold. When looking for a file, the CacheManager will check each predicted structure in order, until it finds the file. It will then move the file to where the current system says it should be. There will likely also be an option to force the migration of every file should the user desire it.

To handle different hash algorithms, a abbreviation indicating the algorithm used will be attached to the end of the file name.

I think that covers everything, but I'd appreciate any feedback or comments that come to mind.

Thanks,
Brock
_______________________________________________
pkg-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Reply via email to