RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Paulo Gaspar Sat, 15 Dec 2001 20:18:44 -0800

Answer inline:

> -----Original Message-----
> From: Stefano Mazzocchi [mailto:[EMAIL PROTECTED]]
> Sent: Sunday, December 16, 2001 1:15 AM
> 
> Paulo Gaspar wrote:
> 
> ...
>
> > Otherwise the cache can end up
> > "thinking" that a given document was very expensive just because it
> > happened to be requested when the system was going trough a load
> > peak, and it might take the place of a much more "expensive"
> > document which was always requested when the system was under lower
> > loads. If theses documents have long "cache lives" a lot of capacity
> > can be lost for long periods of time due to such accidents.
> 
> True, but the user doesn't give a damn about "what" influenced the
> slowness of the document, neither does the cache: it's the actual result
> which is sampled, so maybe the document could take 10ms to generate
> without load (and the cache might take 12ms) but under load it takes
> 25ms and the cache 23ms.
> 
> ...


All you say is true. The problem is with what I wrote - I did not get 
my message trough.

What I mean is that the cache could accumulate quite some wrong cost 
measurements for long periods of time if they would happen with 
resources having a long "cache life".

Lets say you have resources A and B and that, under low load, A takes 
100 ms to build and B takes 15 seconds. Lets also say that both have a
cache life of 24 hours.

If when A is first requested the system is under heavy load it night 
end up taking 30" to get (sh*t like this happens with database
resources). Half an hour later B is requested for the first time and 
the system is already under low load, so it takes those usual 15".

For the next (almost) 24 hours, the cache will think that A is more
expensive than B.

What I say is that this kind of error can happen a lot. That will 
have a different cost depending on the characteristics of the system
(e.g.: longer cache lives => higher cost).

  
> > However, as you also mention, there is the cost of sampling. If you
> > have a processing time expensive document "A" with a maximum cache
> > lifetime of 24 hours that is usually requested 100 times a day...
> > and then you sample how much time it takes to get it 100 times a
> > day, the accuracy gets better but the cost of the sampling is as
> > big as the cost of not caching at all.
> 
> Yes, but the frequency of sampling a resource is inversively
> proportional to the difference between the costs of the two choices.
>
> ...

Again, all you say is true. The problem is with what I wrote - I did 
not get my message trough. Really, what I wrote is open to different
interpretations.

I was talking about how silly is the silly possibility of having more
sample data just by not using stuff that is already cached.

In this silly scenario, the system would try to get more data about how
expensive a resource with long "cache life" is to get by, at "some" 
requests "along the day" getting the resource again from its origin and 
ignoring its cached version.

 
> > But usually you have families of documents that take a similar time
> > to process, like:
> >  - Articles with 4000 words or less without pictures stored in XML
> >    files;
> >  - Product records from a product catalog stored in a database;
> >  - Invoices with less than 10 items from a database.
> > 
> > If your time measurements are made per family, you will usually
> > end up with a much wider set of sample data and hence much more
> > representative results. 
> 
> how do you envision the cache estimating what a 'family of resources'
> is?

You have to tell it, as follows later in the text.


> > The system use will generate the repeated
> > samples and their distribution along the time (and along load peaks
> > and low load periods) will tend to be much more representative than
> > any other mechanism we could come up with.
> > 
> > Besides, your sampling data aggregated per family will take less
> > storage space, hence leaving more room for caching.
> > =:o)
> 
> good point.
 
It is the same you (better) described before in your reply with:

  True, but the user doesn't give a damn about "what" influenced the
  slowness of the document, neither does the cache: it's the actual result
  which is sampled, so maybe the document could take 10ms to generate
  without load (and the cache might take 12ms) but under load it takes
  25ms and the cache 23ms.

  This shows that the cost function is not CPU time or 'single document
  processing time', but it's a more global 'production time for that
  resource at this very time' and *MUST* include everything even your load
  peaks.


> > Now, you only mention one key generation method in your document:
> > 
> > >      | Result #2:                                               |
> > >      |                                                          |
> > >      | Each cacheable producer must generate the unique key of  |
> > >      | the resource given all the enviornment information at    |
> > >      | request time                                             |
> > >      |                                                          |
> > >      |   long generateKey(Enviornment e);                       |
> > 
> > Is this key per instance (the caching key) or per family/type of
> > resource?
> 
> per resource.

Maybe we have a vocabulary problem here:
  I think I do not understand what a resource is in Cocoon.
=:o/

Lets take my previous example:
> >  - Invoices with less than 10 items from a database.

What is a resource? Is it "Invoice detail view" or is it the "detail 
view of invoice number 5678665"?


> > The conclusion from all I wrote above is that a resource producer
> > should probably produce two keys: the cache key and a "sampling
> > family" key that would be use to aggregate cost sampling data.
> > From your document it is not clear is it is this that you propose.
> 
> I can't think of a way to come up with 'resource families', but I'm open
> to suggestions.

Not sure, first I need to understand what "resource" really means!
=:o)


> > I would also prefer string keys since I do not see a meaningful
> > performance gain on using longs and I see much easier use with
> > strings, but this is just my opinion.
> 
> Well, I do: new String() is the single most espensive operation in the
> java.lang package. 
> 
> The java golden rule for performance is: don't you strings if you can
> avoid it.

OTOH, if each request takes 10000 (ten thousand) times the time of 
creating a String and if creating the string really makes things much 
easier...


Have fun,
Paulo Gaspar

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

RE: Adaptive Caching [was Re: initial checkin of the Scheme code]

Reply via email to