On 5/19/07, Will Fould <[EMAIL PROTECTED]> wrote:
I'm afraid that:
   1. hashes get really big (greater than a few MB's each)
   2. re-caching entire hash just b/c 1 key updated (waste).
   3. latency for pulling cache data from remote DB.
   4. doing this for all children.

The most common way to improve speed is to cache things after you
fetch them from the db, rather than pre-fetching as you are now.  You
give them a reasonable timeout value, and always check the cache for
data first, falling back to the db if it's not there.  For
applications  that can tolerate a little stale data and have a
relatively small set of hot data, this works great.  It also assumes
that you can make your code fetch from the db (when the result is not
cached yet) in a slow but reasonable amount of time.

If you want to stick with pre-fetching, you have a few options.  One
is to use memcached.  It will be much slower than your current method.
However, you can update values whenever you like and they will be
visible to all processes on all servers immediately.  You can't count
on data to be there though -- you have to structure your application
so it can fetch from the db if memcached drops some data.  It is not a
database.

Another is to build local shared caches with BerkeleyDB, MySQL on the
local machine, or Cache::FastMmap.  All of these will be faster than a
remote memcached.  You can update them with a cron job on each server
and all children will see the results immediately.  The same caveats
about surviving missing data apply for Cache::FastMmap -- it's not a
database either.

In both cases, you are going to sacrifice performance.  What you'll
get for your trouble is memory -- no more duplicating MBs of data in
every process.

For now, what seems like the 'holy-grail' (*) is to cache last_modified for
each type, (available to the cluster, say through memcached), in a way that
indicates only which parts of the cache (which keys of each hash) the
children need to update/delete such that a child rarely, if ever, will only
need to query for just those keys and directly modify their own hashes
accordingly to keep current.

That actually sounds pretty easy -- put a timestamp on your rows and
only fetch the data that changed since last time you asked.

- Perrin

Reply via email to