Re: [RT] Adaptive Caching

Stefano Mazzocchi Wed, 16 Jul 2003 11:56:03 -0700

On Wednesday, Jul 16, 2003, at 07:38 America/Guayaquil, Berin Loritsch wrote:

My statements about the limited value (cost/benefit ratio) of partial pipeline caching has to do with *my* experience. Maybe others have had different experiences, but all my dynamic information was always encapsulated in one place: the generator. The transformers had an infinite ergodic period (infinite in that they did not change until there was a development reason to do so). The serializers were quite simple.

The CIncludeTransformer has an undefinable ergodic period, since it depends on the aggregated resources and these are known only at runtime.

It is true that, for caching reasons, we tried to enforce people to move everything that influenced caching at generation stage (and avoid making serializers have access to the environment), but this is not always possible, as the CIncludeTransformer suggests.

Sure, it is possible to have the cinclude behavior into the generator, but the whole benefits of pipelining goes down the drain.

we must be more general than what you suggest or the cache will only adapt to your specific cases.

With that combination facts, the only thing in my pipeline with an
ergodic period less than infinity was the generator.  For my static
pages, I could have simply compiled the results and served them.  For
my dynamic pages, the generation was pretty darn quick.

Then again, maybe my *experience* is not typical, which means what is
good enough for me is not good enough for others.

Exactly, but being the caching more general and adapting to the optimal solution by itself, what is good enough for a general concept, should be good enough for the specific one.

The basic crux of programming Intelligent Agents is that agents react
to the state of the environment, discover the best response, and act
(usually in a way that affects the environment).

What you outlined was the "search algorithm" (the discovery method),
and the maner in which the agent (cache controller) acted upon the
environment.

What you left as an excersize to the reader was the Cost function.

See my email to Marc for more insights.

The rules-based approach is one way of achieving that cost function declaratively.

True. But I prefer a purely statistical approach because I'm lazy and I don't want write those rules: I want to the system to find them out for me. ;-)

Any time you incorporate conditional logic in your cost function, you
have explicit rules.

True. But I showed that you don't need to. Costs are always calculated the same way (which is by the economical definition of a defined cost, BTW)

Any time you have a weighted cost that factors in
different elements you have implicit rules.

Again true and I concur that finding out those weights can be tricky. But, IMO, choosing three numbers (in case of the time/memory/disk function) is much easier than finding out each and every rule.

Explicit rules are easier to
debug and understand, also to predict.

Here I disagree.

An example from spam filtering might give you an insight of what I mean. My bogofilter found out that the email with #FF0000 where likely to be spam. This is because #FF0000 is the HTML code for the red color and spam use red to catch attention (since then, spammer understood that in order to get thru, spam needs to look like regular email so they avoided that and this rule is much less effective).

My point is: I don't know about you, but I would have never came up with a rule such as "if it contains #FF0000, then increase the spamicity" (spamicity for bogofilter is the measure of the likeliness of an email being spam, it is a real number between 0.0 and 1.0, where 1.0 is definately spam, 0.0 is definately ham)

Moving this in the realm of caching, as I previously state, it might be the case that the cache is efficient between 11:30AM to 2:00PM but never in the rest of the day. How are you going to find out that rule by yourself? Besides, suppose that your web site attracts more people from other timezones: you have to change the rules accordingly.

I can make tons of examples on why an adaptive system is much better than a rule engine and, in general, can perform equally better given a reasonable amount of statistical information.

There are different ways of applying explicit rules without resorting to if/then/else or switch/case functions. In fact you might be able to come up with a way to translate explicit rules into a function with implicit rules.

This is exactly what I did.

Truth be told, your search algorithm might be as efficient as they come. It might not.

True. It has to be demostrated. But the incredible ability of bogofilter gives me hope that it is the case even for caching.

But do keep in mind the poor administrator/installer. The guy who is managing the install is more interested in the efficiency of the cache and how Cocoon uses resources than the developer. The user in many cases won't know the difference because their network connection is throttling the response, or their browser is the bottleneck. The efficiency of the cache will likely have the most impact on the scalability of the server. That is my OPINION, and should not be taken as gospel. Please do not shoot me for holding that opinion.

Not at all. When implemented, the dynamic adaptability will be configurable and you can turn it off or plug in your own cache if you dislike it.

Still, keeping in mind the poor administrator/installer, I don't see how a rule engine can be easier to configure than a system that doesn't need anything to learn and adapting and optimizing resources as it works. by itself and with no need for supervision.

I took the 18 pages home and started to play with some code. I appreciate the journey to get to the search algorithm you proposed. The functions as declared in the beginning were not that efficient, and on my machine as soon as you had more than ~410 samples for the invalid/valid cost evaluations the function took ~10ms (the granularity of my clock) to evaluate. The evaluation used constants for all the cost figures because that would be the computationally cheapest way to evaluate the efficiency of those functions.
After this exercise, to my dismay, was the better way of evaluating a
continual cost value.  That way instead of having a sample for each
request at a particular time, we only had to work with three samples.
A vastly improved search algorithm in terms of being computationally
cheap.

Hmmm, I'm very intersted on this, can you elaborate more?

Before it was unclear what type would best represent the cost function, but when you introduced the trigonometric functions into the mix, it was clear that floating point precision was required.

I think it's the case and today's CPU don't make it slower to use floating-point math. Still, the hyperbolic tangent function was just one solution. there is at least another family of simply polynomial functions that exhibit the same overall properties.

Perhaps seeing what you want in code would be the best thing. It would help solidify things for us that don't do the math or are overwhelmed by a very long whitepaper and trying to derive from it how it will practically make our lives better. It is would help determine the cost/benefit ratio of actually developing this thing. As it stands, any idea that takes 18 pages to explain gives the impression of a very high cost. Whether this bears out in practice or not remains to be seen.

I can't tell you when, but I'll do this.

At the end, it might all be an accademic exercise that doesn't work IRL, I don't know. But in case it does, it would be a very different system that would provide cocoon with even more advantages over other solutions (or, at least, that's my goal).

thanks for taking the time to investigate more into this. very appreciated.

--
Stefano.

Re: [RT] Adaptive Caching

Reply via email to