Hi Stefano and other caching maniacs,
I am just going to make a brief post now, since I am overloaded at the moment (hence my delay on replying). This is very interesting stuff! And I like very much the hotspot analogy. I was also following the initial cache discussion but I missed the document you attached. I am only going to focus on the major obstacle I see: - sampling accuracy. In most cases (e.g.: the typical web site use) we can consider the cost function is related to processing time being storage volume a restriction. Of course that under this perspective we would also have to consider serialization time, especially when using a 2 level cache, but it is better to ignore that at the moment. You also mention something like this in you text: > The easiest cost function is "elapsed time": this indicates that the > most expensive resource is computing power and speed is the most > valuable property of functional operation. Then you scratch the problem: > A pure-Java implementation is possible with a time granularity of > milliseconds using System.currentTimeMillis(). Smaller granularities > require native hooks either to the system or to the JVM using JNI or the > JVM Native Profiling Interface. For most serving environments, > millisecond granularity is sufficient. Actually, I think the typical (near 10 ms) granularity of Java is enough because there is another factor that affects much more the accuracy of this kind of measurement... and you scratch that one too: > A pure-Java implementation of such "time + memory" cost function is > almost impossible due to concurrency: there is no way for a particular > thread to know how much memory it has used since the JVM gives > information only about the system free memory and doesn't allow more > granularity than that. Concurrency also affects A LOT (especially on production systems) the accuracy of elapsed time measurement, by much more than the above mentioned (near 10 ms) granularity. This happens not only because of the load in the machine where the system is running but also because many other factors like the load on the database server or the network you use to access the database and other external resources. In fact, it is impossible to know how much "pure" processing time it takes to obtain a resource since a lot of waiting (for other threads of resource availability) is always involved. The only hope one can have of getting some good sampling is trough quantity: repeated random sampling. Otherwise the cache can end up "thinking" that a given document was very expensive just because it happened to be requested when the system was going trough a load peak, and it might take the place of a much more "expensive" document which was always requested when the system was under lower loads. If theses documents have long "cache lives" a lot of capacity can be lost for long periods of time due to such accidents. However, as you also mention, there is the cost of sampling. If you have a processing time expensive document "A" with a maximum cache lifetime of 24 hours that is usually requested 100 times a day... and then you sample how much time it takes to get it 100 times a day, the accuracy gets better but the cost of the sampling is as big as the cost of not caching at all. But usually you have families of documents that take a similar time to process, like: - Articles with 4000 words or less without pictures stored in XML files; - Product records from a product catalog stored in a database; - Invoices with less than 10 items from a database. If your time measurements are made per family, you will usually end up with a much wider set of sample data and hence much more representative results. The system use will generate the repeated samples and their distribution along the time (and along load peaks and low load periods) will tend to be much more representative than any other mechanism we could come up with. Besides, your sampling data aggregated per family will take less storage space, hence leaving more room for caching. =:o) Now, you only mention one key generation method in your document: > | Result #2: | > | | > | Each cacheable producer must generate the unique key of | > | the resource given all the enviornment information at | > | request time | > | | > | long generateKey(Enviornment e); | Is this key per instance (the caching key) or per family/type of resource? The conclusion from all I wrote above is that a resource producer should probably produce two keys: the cache key and a "sampling family" key that would be use to aggregate cost sampling data. >From your document it is not clear is it is this that you propose. I would also prefer string keys since I do not see a meaningful performance gain on using longs and I see much easier use with strings, but this is just my opinion. Thanks for the great food for thought you always provide. Have fun, Paulo Gaspar http://www.krankikom.de http://www.ruhronline.de > -----Original Message----- > From: Stefano Mazzocchi [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, December 12, 2001 7:17 PM > > > Talking about placing too many irons in the fire :) > > Paulo Gaspar wrote: > > > The "adaptive caching" idea just arrived too early but I hope it is not > > forgotten. > > You can bet your ass it's not :) [excuse my oxford english] > > The concept I proposed (more than 6 months ago) about adaptive caching > is difficult to understand because it's very different from the usual > caching approach. > > Let me try to explain it: > > when a request arrives to a resource (might also be a pipeline > fragment), you have two choices: ask the cache if the resource is still > valid or go right ahead and regenerate it. > > And then, after it's regenerated, should I save it for further use? if > so, where? disk or memory? and if the memory is full, what should I > throw away? > > There is a very simple answer for all of this: do what it gives you the > best performance. > > Period. > > The real problem is: once I've taken a decision (cache or don't cache) > how do I know what would have happened performance-wise if I had taken > the other choice? > > This is the key problem. > > I propose a simple solution: add 'selection noise' and use incoming > requests as sampling events on the system to test. > > It's a trick: the objective is to reduce the time it takes to handle > those requests, but I use them to obtain time information on the system > and I superimpose a stocastic sampling to the decision-making process. > > The problem is that sampling uses user requests, so we must reduce the > 'noisy' behavior of these requests: my solution is to make this > 'selection-noise' a function of the difference between the two paths. > So, if going thru the cache system is, on average, 3 times as fast, I > use 1 request over 100. Otherwise, if the cache yields 20% improvement > (1.2 times as fast), I sample 1 over 5 and so on. > > This guarantees that: > > 1) users don't perceive this since the faster is one path, the less > frequent the other path is sampled. > > 2) the system remains stable and adaptive: if sampling the other path > reduces the difference, the frequency of sampling increases, thus > ensuring a way to 'steer' the decision making following the actual > system performance. > > Sampling sacrifices a small peak performance for adaptibility, but > allows the cache system to transparently adapt even to those cases where > caching makes it "slower" than regenerating the resource and avoiding > storing it into the cache system (which also has a cost). > > Another almost magic effect of this system is that we have a direct > measure of the efficency of the cache: assuming time-locality of > actions, I have a way to measure directly the amount of CPU time the > cache system saved. > > How so? > > Everytime a request comes, I have the information on the 'estimated' > time the resource will need to be generated on the different routes. > Once the decision is taken, I have the information on how much it took > to get it and I can compare it with the "assumed" time that would have > taken on the other path. Then I know how much time I *saved* with this > choice. > > But we don't stop here: we also have a way to measure the efficiency of > the cache itself between cache hits and cache misses. > > This information might be used to estimate the RAM a server would > require to obtain near-maximum efficiency. > > And all this without even touching a single configuration!!! > > A wonderful example of the advancement of adaptivity on caching is that > you have an immediate feedback (with numbers!!!) on how much a change, > say, in your database monitoring code, influences caching efficiency. > > This is a completely innovative approach because the decision whether or > not to cache something is estimated a-priori by system administrators, > but in complex systems, very few people can make this decision and tune > it for every change in the system. > > Consider this as a hotspot caching machine. > > Find attached my original message (WARNING: it's more than 20 printed > pages!), maybe this time more people will understand it. :) > > -- > Stefano Mazzocchi One must still have chaos in oneself to be > able to give birth to a dancing star. > <[EMAIL PROTECTED]> Friedrich Nietzsche > -------------------------------------------------------------------- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]