> I'm interested to know what the opinions are of those on this list with
> regards to caching objects during database write operations. I've
encountered
> different views and I'm not really sure what the best approach is.

I described some of my views on this in the article on the eToys design,
which is archived at perl.com.

> Take a typical caching scenario: Data/objects are locally stored upon
loading
> from a database to improve performance for subsequent requests. But when
> those objects change, what's the best method for refreshing the cache?
There
> are two possible approaches (maybe more?):
>
> 1) The old cache entry is overwritten with the new.
> 2) The old cache entry is expired, thus forcing a database hit (and
> subsequent cache load) on the next request.
>
> The first approach would tend to yield better performance. However there's
no
> guarantee the data will ever be read. The cache could end up with a large
> amount of data that's never referenced. The second approach would probably
> allow for a smaller cache by ensuring that data is only cached on reads.

There are actually thousands of variations on caching.  In this case you
seem to be asking about one specific aspect: what to cache.  Another
important question is how to ensure cache consistency.  The approach you
choose depends on frequency of updates, single server vs. cluster, etc.

There's a simple answer for what to cache: as much as you can, until you hit
some kind of limit or performance is good enough.  Sooner or later you will
hit the point where the tradeoff in storage or in time spent ensuring cache
consistency will force you to limit your cache.

People usually use something like a dbm or Cache::Cache to implement
mod_perl caches, since then you get to share the cache between processes.
Storing the cache on disk means your storage is nearly unlimited, so we'll
ignore that aspect for now.  There's a lot of academic research about
deciding what to cache in web proxy servers based on a limited amount of
space which you can look at if you have space limitations.  Lots of stuff on
LRU, LFU, and other popular cache expiration algorithms.

The limit you are more likely to hit is that it will start to take too long
to populate the cache with everything.  Here's an example from eToys:

We used to generate most of the site as static files by grinding through all
the products in the database and running the data through a templating
system.  This is a form of caching, and it gave great performance.  One day
we had to add a large number of products that more than doubled the size of
our database.  The time to generate all of them became prohibitive in that
our content editors wanted updates to happen within a certain number of
hours but it was taking longer than that number of hours to generate all the
static files.

To fix this, we moved to not generating anything until it was requested.  We
would fetch the data the first time it was asked for, and then cache it for
future requests.  (I think this corresponds to your option 2.)  Of course
then you have to decide on a cache consistency approach for keeping that
data fresh.  We used a simple TTL approach because it was fast and easy to
implement ("good enough").

This is just scratching the surface of caching.  If you want to learn more,
I would suggest some introductory reading.  You can find lots of general
ideas about caching by searching Google for things like "cache consistency."
There are also a couple of good articles on the subject that I've read
recently.  Randal has an article that shows an implementation of what I
usually call "lazy reloading":
http://www.linux-mag.com/2001-01/perl_01.html

There's one about cache consistency on O'Reilly's onjava.com, but all the
examples are in Java:
http://www.onjava.com/pub/a/onjava/2002/01/09/dataexp1.html

Also, in reference to Rob Nagler's post, it's obviously better to be in a
position where you don't need to cache to improve performance.  Caching adds
a lot of complexity and causes problems that are hard to explain to
non-technical people.  However, for many of us caching is a necessity for
decent performance.

- Perrin

Reply via email to