Good points, Paulo. A few short comments... Paulo Gaspar wrote:
> Hi Stefano and other caching maniacs, > > > I am just going to make a brief post now, since I am overloaded at > the moment (hence my delay on replying). > > This is very interesting stuff! And I like very much the hotspot > analogy. I was also following the initial cache discussion but I > missed the document you attached. > > > I am only going to focus on the major obstacle I see: > - sampling accuracy. > > > In most cases (e.g.: the typical web site use) we can consider the > cost function is related to processing time being storage volume a > restriction. Of course that under this perspective we would also > have to consider serialization time, especially when using a 2 > level cache, but it is better to ignore that at the moment. > > You also mention something like this in you text: > > >>The easiest cost function is "elapsed time": this indicates that the >>most expensive resource is computing power and speed is the most >>valuable property of functional operation. >> > > Then you scratch the problem: > > >>A pure-Java implementation is possible with a time granularity of >>milliseconds using System.currentTimeMillis(). Smaller granularities >>require native hooks either to the system or to the JVM using JNI or the >>JVM Native Profiling Interface. For most serving environments, >>millisecond granularity is sufficient. >> > > Actually, I think the typical (near 10 ms) granularity of Java is > enough because there is another factor that affects much more the > accuracy of this kind of measurement... and you scratch that one > too: > > >>A pure-Java implementation of such "time + memory" cost function is >>almost impossible due to concurrency: there is no way for a particular >>thread to know how much memory it has used since the JVM gives >>information only about the system free memory and doesn't allow more >>granularity than that. And because of this, different caches might have to be 'weighted', perhaps according to the type of component they hold. > > Concurrency also affects A LOT (especially on production systems) > the accuracy of elapsed time measurement, by much more than the > above mentioned (near 10 ms) granularity. True. I've also noticed that even though fast, the calls to System.currentTimeMillis() can be about 10 times slower than e.g. simple getters. That's why it sometimes makes sense to have a 'LazyClock' that monitors the system time by periodic calls System.currentTimeMillis(). This is of course just a minor implementation detail. > > This happens not only because of the load in the machine where the > system is running but also because many other factors like the load > on the database server or the network you use to access the > database and other external resources. In fact, it is impossible to > know how much "pure" processing time it takes to obtain a resource > since a lot of waiting (for other threads of resource availability) > is always involved. > > The only hope one can have of getting some good sampling is trough > quantity: repeated random sampling. Otherwise the cache can end up > "thinking" that a given document was very expensive just because it > happened to be requested when the system was going trough a load > peak, and it might take the place of a much more "expensive" > document which was always requested when the system was under lower > loads. If theses documents have long "cache lives" a lot of capacity > can be lost for long periods of time due to such accidents. > > However, as you also mention, there is the cost of sampling. If you > have a processing time expensive document "A" with a maximum cache > lifetime of 24 hours that is usually requested 100 times a day... > and then you sample how much time it takes to get it 100 times a > day, the accuracy gets better but the cost of the sampling is as > big as the cost of not caching at all. I'm not sure that I understand this. I think Stefano's idea was to use normal requests for sampling. If the document is served a lot faster from the cache, very few requests would cause it to be regenerated (unless it's modified). The actual sampling calculations are very cheap compared to I/O or XSLT operations. > > But usually you have families of documents that take a similar time > to process, like: > - Articles with 4000 words or less without pictures stored in XML > files; > - Product records from a product catalog stored in a database; > - Invoices with less than 10 items from a database. > > If your time measurements are made per family, you will usually > end up with a much wider set of sample data and hence much more > representative results. The system use will generate the repeated > samples and their distribution along the time (and along load peaks > and low load periods) will tend to be much more representative than > any other mechanism we could come up with. I understand the issue, but I'd like to see the process completely automated. I think we might get enough information just by sampling the time it takes to produce a resource (and perhaps the frequency of the requests). > > Besides, your sampling data aggregated per family will take less > storage space, hence leaving more room for caching. > =:o) > > > Now, you only mention one key generation method in your document: > > >> | Result #2: | >> | | >> | Each cacheable producer must generate the unique key of | >> | the resource given all the enviornment information at | >> | request time | >> | | >> | long generateKey(Enviornment e); | >> > > Is this key per instance (the caching key) or per family/type of > resource? > > > The conclusion from all I wrote above is that a resource producer > should probably produce two keys: the cache key and a "sampling > family" key that would be use to aggregate cost sampling data. >>From your document it is not clear is it is this that you propose. > > I would also prefer string keys since I do not see a meaningful > performance gain on using longs and I see much easier use with > strings, but this is just my opinion. I tend to agree. I once did a benchmark on this and found out that unless custom long->Object maps are used, the key object instantiation cost is higher than any such (barely noticeable) performance gain. As Cocoon uses custom cache key objects (that internally use strings), I can't really see any advantage in using longs (I might of course be missing something). BTW, the Colt project ("Open Source Libraries for High Performance Scientific and Technical Computing in Java") provides several nice implementations of various collection types (and much more). Also the javadocs are some of the best I've ever seen :) And here's the link: http://tilde-hoschek.home.cern.ch/~hoschek/colt/ (: Anrie ;) --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]