Joel Rosdahl <j...@rosdahl.net> wrote: > On 7 January 2018 at 14:02, Scott Bennett wrote: > > > The design problem is that there is no centralized index maintained of > > cache entries' paths, their sizes, and their timestamps, necessitating > > the plumbing of the directory trees. [...] > > Thanks for sharing your ideas!
You may wish to retract any thanks once you've read what follows. The current independence of ccache from any other third-party software is valued and for good reasons. However, I hope to show below a better way to do things. That independence can still be maintained, but only at the cost of another wheel reinvention. :-( > > I fully agree that the cleanup algorithm/design hasn't aged well. It has > essentially stayed the same since Tridge created ccache in 2002, when > storage devices were much smaller and a cache of one GB or two probably > was considered quite large. > > Trying to improve the cleanup algorithm/design has not been a priority > since I personally haven't seen such pathological behavior that you > describe ("cleanups that can take over a half hour to run and hammer a I don't know whether users of other operating systems are using ccache in building their systems, but many FreeBSD users do so because the time savings are so great. When one can cut a build time of six hours to, say, an hour and a half, one tends to appreciate the tool(s) that make(s) it possible. I.e., we use and love ccache because, in general, it works so well and improves performance so much. However, compiling an operating system means a pretty large cache area is needed if one is to fit the working set within the cache. Similarly, FreeBSD users who compile third-party software from the ports tree, rather than installing it from prebuilt packages, potentially need an even larger cache area whose size roughly depends upon the number and size of the ports built and installed onto their systems. For example, I currently have over 2300 ports installed, which should make clear the reason my ports cache area is so large. Large cache areas take a long time for the "cleanups" to run. (FWIW, I use cache_dir_levels = 5, which may not be optimal in terms of performance. I don't have a good way of determining the optimal depth to use for the cache directory trees. It seems to be very, very fast for use in building things, but may well be a killer for cleanups.) > hard drive mercilessly"). However, I'm not at all convinced that > introducing a centralized index is the panacea you describe. Countless data base software implementations handle these situations acceptably well. > > Do you have a sketch design of how to maintain a centralized index? Here Well, sort of. I.e., I haven't written up a design spec or anything of that sort, but some things seem rather obvious. > are some requirements to consider for the design: > > A. It should cope with a ccache process being killed at any time. Sure. > B. It should work reasonably well on flaky and/or slow file systems, > e.g. NFS. No, not at all. Using a file system as data base software is usually a Very Bad Idea (tm). > C. It should not introduce lock contention for reasonable use cases. > D. It should be quick for cache misses (not only for cleanup). > E. It should handle cleanup quickly and gracefully. In my view, the above are misconceived in the sense that they are predicated upon the use of file system code as data base software. > > I'm guessing that you envision having one centralized lock for the > index. The tiny stats files already suffer from lock contention in some > scenarios because they are so few. That's why ideas like > https://github.com/ccache/ccache/issues/168 and comments like > https://email@example.com/msg01011.html > (comment number 2) pop up. Even if a centralized index only needs a lock > for writing, it would still serialize writes to the cache. I have > trouble seeing how that would work out well. But I'll gladly be proved > wrong. > Try this on for size for a moment. Imagine the software as two programs, ccache and ccached. ccache would contain all the current code analysis and comparison (including hashes) stuff that it currently has, but it would make a connection via UDP or TCP to the other program, which we will call ccached, to access the cache data base. Modern data base software packages do very well at handling multiple, simultaneous clients, atomic commission of updates, multiple indices, and so forth. Now, keep in mind that this "ccached" might be a specialized program linked to data base software or it might simply be a generic data base server. Multiple caches (in the current sense) might be maintained as separate data bases, either through a single server instance or as multiple, discrete server processes, depending upon the software chosen for the purpose, but the server(s) would be accessed by potentially many concurrent ccache processes and could deal with consistency/integrity issues at the cache-entry or cache-entry-element level. Please don't ask me for a recommendation of particular data base software because I haven't the foggiest idea. I haven't worked with a data base package since the early 1970s, although I did work considerably later with various software that today would be thought of a data base applications, but were not so thought of at the time, that used IBM's ISAM. Back then, a data base typically involved many files and indices, all interlinked at the record level, so an access method like ISAM was not, by itself, sufficient to be called a data base, but it was sometimes a component of a data base. Very often, though, people wrote their own data base access methods or bought a commercial data base package. A data base was a more formal affair with every field defined in a data dictionary, etc., etc. ccache needs nothing so complex, but you would need to consult someone familiar with each of the "modern" types of data base software available to decide which way to go. Very possibly you have the requisite knowledge/experience yourself. To modify ccache to use data base software is admittedly a major rewriting job, so I expect such an idea to put you off, but it's a project that should ultimately yield a far superior product, IMO. Those are my two bits' worth, and you are more than welcome to take shots at what I've written. Scott Bennett, Comm. ASMELG, CFIAG ********************************************************************** * Internet: bennett at sdf.org *xor* bennett at freeshell.org * *--------------------------------------------------------------------* * "A well regulated and disciplined militia, is at all times a good * * objection to the introduction of that bane of all free governments * * -- a standing army." * * -- Gov. John Hancock, New York Journal, 28 January 1790 * ********************************************************************** _______________________________________________ ccache mailing list firstname.lastname@example.org https://lists.samba.org/mailman/listinfo/ccache