Re: Cache Invalidation Proposal -- CachedModel

Jeremy Dunck Wed, 19 Dec 2007 20:33:15 -0800

Here's the IRC chat from today:

Most useful bits:


django-orm-cache.googlecode.com now exists.

todo:
[10:29pm] jdunck: Investigate whether ferringb (of curse) supplied
signal performance patch
[10:29pm] jdunck: Add multiget to all cache backends
[10:29pm] jdunck: Add max_key_length, max_value_length to all cache backends
[10:29pm] jdunck: add memcache's replace semantics for
replace-only-if-exists semantics
[10:29pm] jdunck: Support splitting qs values over max_value_length
(in other words, do multiple sets and gets for a single list of
objects if needed)
[10:29pm] jdunck: bench sha vs. (python) hsieh and jenkins
[10:29pm] jdunck: test w/o CachedModel __metaclass__ since that's a bit silly.
[10:29pm] jdunck: invalidate whole list if any key in list is missing
- ask dcramer
[10:29pm] jdunck: All related field descriptors should check cache first
[10:29pm] jdunck: Port to qs-refactor

Full transcript:

[7:27pm] jdunck: so, orm-based caching
[7:27pm] jdunck: seems like row-level-caching goog soc never went anywhere?
[7:27pm] jdunck: (you have time to talk now?)
[7:33pm] zeeg: ya i do
[7:33pm] zeeg: what you mean goog soc
[7:33pm] jdunck: http://django-object-level-caching.googlecode.com/
[7:34pm] zeeg: oh
[7:34pm] zeeg: i dont like that
[7:34pm] zeeg: at all
[7:34pm] zeeg: but ya
[7:34pm] zeeg: you read over all my stuff?
[7:34pm] jdunck: well, i saw it's monkey-patching the standard QS
[7:34pm] jdunck: i did.
[7:35pm] zeeg: ya really what I want at the core, is a magical CachedModel
[7:35pm] zeeg: that can handle all invalidation that (for most uses) we need
[7:35pm] zeeg: which is just delete invalidation
[7:35pm] zeeg: then using signals for model level dependancies (via
registration)
[7:35pm] zeeg: and reverse key mappings (except a Model + pks mapping)
for row-level
[7:35pm] jdunck: yeah-- over the sprint, i talked to jacob about some
new signal ideas-- and he said he doesn't want to add any signals
without first improving signal performance.
[7:36pm] jdunck: i *like* signals, so don't mind improving their performance
[7:36pm] zeeg: ya i think trunk's signals still suck
[7:36pm] zeeg: ferringb patched ours (he's one of the devs at Curse)
[7:36pm] zeeg: im not sure if his patch made trunk tho
[7:36pm] zeeg: but ya, signals for dependencies is the last of my concerns
[7:36pm] zeeg: ill rely on expiration based caching mostly
[7:36pm] zeeg: but being able to handle invalidation at the row level
is.. beautiful
[7:37pm] zeeg: its obviously a bit more of a performance hit handling
caching like this, but my tests showed it wasn't big enough to matter
[7:37pm] zeeg: i was even going to add in the pre-expiration routines
[7:37pm] zeeg: (so if something expires in <predefined> minute, it
gets automatically locked and recached by the first person to see it)
[7:38pm] jdunck: not sure what you mean by "locked"
[7:38pm] zeeg: basically, when you set a key, you set either another
key, or a in that key you're setting you tokenize it
[7:38pm] zeeg: and that other key, or the first token
[7:39pm] zeeg: contains the expiration time, or expiration time - minutes
[7:39pm] zeeg: and when you fetch that key
[7:39pm] zeeg: if that expiration time has been reached (the
pre-expiration), you set a lock value, which says, if anyone else is
looking at this and checking, ignore it
[7:39pm] zeeg: and then you recreate that cache
[7:39pm] jdunck: ah.  you assume no purging due to MRU memory limits?
[7:39pm] zeeg: well ya, only so much you can plan for
[7:39pm] zeeg: but w/ that, it potentially stops heavily accessed keys
[7:40pm] zeeg: from being regenerating 100s of times
[7:40pm] jdunck: fwiw, here's a wrapper i made to deal with the same
problem: http://code.djangoproject.com/ticket/6199
[7:40pm] zeeg: if they take too long to generate
[7:40pm] zeeg: ah ya
[7:40pm] zeeg: thats your code?
[7:40pm] jdunck: yeah
[7:40pm] zeeg: I think I saw that linked on memcached
[7:40pm] zeeg: they talked about the usage at CNET and I thought it'd
be a great addition
[7:40pm] jdunck: hmm.  i posted on that list a while back, but it
wasn't a ticket at the time.
[7:41pm] zeeg: ya i just remember seeing the code
[7:41pm] jdunck: well, anyway, do you not like that approach?  just
wrapping stampedes for the whole backend?
[7:41pm] zeeg: and im like, cool, it must be useful if others are doing it
[7:41pm] zeeg: well in the backend I think its the best approach actually
[7:41pm] jdunck: i can see some ppl being annoyed that it has some
book-keeping overhead and doesn't store exactly what you say to store.
[7:41pm] zeeg: the way CNET did it, was they used 3 keys
[7:42pm] zeeg: actual data, expiration key, and locking key
[7:42pm] zeeg: which i can see benefits of doing it both in seperate
keys, and in a combined key
[7:43pm] jdunck: do you use gearman or some other background jobber?
[7:43pm] zeeg: nope
[7:43pm] zeeg: not familiar w/ them
[7:44pm] jdunck: i mean, my understanding is that [EMAIL PROTECTED] went a
totally different direction-- have a daemon that feeds in updated
keys, so that web app never misses keys
[7:44pm] jdunck: (obviously doesn't work for ad hoc stuff)
[7:44pm] zeeg: ah ya we do that for a few things
[7:44pm] zeeg: only things that are slow to cache tho
[7:44pm] jdunck: do you have a > 1MB memcached compilation?
[7:44pm] jdunck: i was surprised to find that hard limit.  QS results
can easily reach that.
[7:45pm] zeeg: one sec brb
[7:45pm] zeeg: like 1mb in a key?
[7:45pm] jdunck: yeah
[7:45pm] jdunck: crazy-talk, know.
[7:45pm] jdunck: in your scheme, can you imagine a list of object keys
getting to 1MB?
[7:46pm] jdunck: 100 bytes per key, list of 10000 object keys would
result ~1mb; missed key set in standard memcache
[7:48pm] zeeg: hrm
[7:48pm] zeeg: so you mean a cache that would store 10k objects in it?
[7:50pm] jdunck: let me back up.  a standard memcache will only store
a key value of 1mb or less
[7:50pm] jdunck: you can compile it to store more per key value
[7:51pm] jdunck: we (pegnews.com) are currently through queryset
results in cache
[7:51pm] jdunck: sometimes that results in a miss because the qs is too big.
[7:51pm] jdunck: we're silly for throwing in huge qs anyway, but
quick-n-dirty mostly works
[7:52pm] jdunck: anyway, if i understand correctly, your cacheqs would
store hash(qs kwargs) as the key, and [ct_id:pk_val1, ct_id:pk_val2,
...] as the value
[7:52pm] jdunck: each individual object has ct_id:pk_val:1 as the key,
and the model instance as the value
[7:52pm] jdunck: right?
[7:53pm] jdunck: i was just pointing out that a result list long
enough would still hit the 1mb limit, resulting in a miss on the qs
key lookup.
[7:55pm] zeeg: ya
[7:55pm] zeeg: you'd still have the same limitation
[7:55pm] zeeg: my plan was to store
[7:55pm] zeeg: hrm
[7:55pm] zeeg: what was my plan
[7:55pm] jdunck: hah
[7:55pm] zeeg: i think it was up in the air
[7:55pm] zeeg: but it'd be like
[7:55pm] zeeg: ModelClass,(pk, pk, pk, pk),(related, fields, to, select)
[7:56pm] zeeg: feel free to poke holes
[7:56pm] zeeg: the one issue i see
[7:56pm] zeeg: im not sure how big ModelClass is
[7:56pm] zeeg: when serialized
[7:59pm] zeeg: but w/ this cool system
[7:59pm] zeeg: if you *needed* too
[7:59pm] zeeg: you could say "oh shit im trying to insert too much"
[8:00pm] zeeg: and be like ModelClass, (pks*,), (fields*), number_of_keys
[8:00pm] zeeg: and split it into multiple keys
[8:00pm] zeeg: it would be nearly just as fast
[8:00pm] zeeg: basing off of my multi-get bench results
[8:00pm] zeeg: thats what i like about taking this approach
[8:00pm] zeeg: is the developer doesnt have to worry about any of that
[8:04pm] zeeg: im actually hoping to get a rough version of this done
over the holidays while im on vaca
[8:05pm] jdunck: the (related,fields,to,select) bit above is FK/M2M
rels to follow?
[8:05pm] zeeg: select_related more or less
[8:05pm] zeeg: so it knows what to lookup in the batch keys when it grabs it
[8:05pm] jdunck: yeah.. i wonder what select_related does for cycles...
[8:05pm] zeeg: so it does select list -> select list of pks (batch) ->
select huge batch of related_fields
[8:05pm] jdunck: yeah, i follow
[8:05pm] zeeg: although that potentially may have to be split up too
[8:06pm] zeeg: is there a limit on how much data sends back and forth
between memcached
[8:06pm] jdunck: yeah, that's a simple abstraction, no biggie
[8:06pm] zeeg: or is that the 1mb you were referring to (i was assuming storage)
[8:06pm] jdunck: it's 1mb per key value by default in memcache.
[8:06pm] zeeg: k
[8:06pm] jdunck: other backends are different, i'm sure
[8:06pm] zeeg: ya dont care about those tho
[8:06pm] zeeg: if anyone uses anything else they're not looking for
the kind of performance this is aimed at
[8:07pm] zeeg: but in theory, it'd support them
[8:07pm] zeeg: (i dont think they allow multi-gets tho, so it probably
does them one at a time)
[8:07pm] jdunck: i don't really care about them either, but if this is
to go in core, we probly should make max_value_size and
supports_multiget as vals on the cache backend
[8:08pm] zeeg: doesnt cache backend all have multi get by default?
[8:08pm] zeeg: i saw it in the memcached code so i assumed it was
across the board
[8:08pm] zeeg: (i want to personally add incr/decr into the cache backend)
[8:08pm] zeeg: thats another thing id like to potentially support with
this, is namespaces
[8:08pm] zeeg: but thats another pretty big addition
[8:08pm] zeeg: and can come later
[8:08pm] jdunck: nope, not in file, for example
[8:08pm] jdunck: easy to add, tho, that's a good point
[8:09pm] zeeg: but being that cache keys are db_table:hash, should be
fairly easy
[8:09pm] jdunck: honestly, i don't get what incr/decr does.  are you
hand-rolling ref-counting on something?
[8:09pm] jdunck: i mean, i understand what the primitive does, i'm
just not smart enough to see the point
[8:10pm] zeeg: ya if you used namespaces it could help
[8:10pm] zeeg: iof you were threaded
[8:10pm] zeeg: and you did cache.get then cache.set
[8:10pm] zeeg: it could be invalid
[8:10pm] zeeg: vs cache.incr
[8:10pm] zeeg: or w/e
[8:11pm] jdunck: are you making your code avl in hg somewhere?
[8:12pm] jdunck: i mean, how do i contribute
[8:13pm] zeeg: hrm i can see what it'd take to get a branch setup on djangoproj
[8:13pm] zeeg: i dont think i can set it up on curse
[8:13pm] zeeg: as i think ours requires auth
[8:13pm] zeeg: (not too familiar w/ setup)
[8:13pm] jdunck: yeah
[8:14pm] zeeg: the code ive got so far is just a copy paste of
modelbase/model editing the parts that needs changed, and a copy paste
of our current "CacheManager" code which does no invalidation
whatsoever
[8:14pm] jdunck: i'm tracking 2 branches already.  my head's gonna
explode.  i was just hoping to take changesets from you and quilt them
or something.
[8:14pm] zeeg: ah ya
[8:14pm] zeeg: id rather it not even be in a branch honestly
[8:14pm] zeeg: id rathe rjust external it in my cur stuff
[8:14pm] zeeg:
[8:14pm] zeeg: much easier
[8:14pm] zeeg: actually i can set it up on google code i think
[8:15pm] zeeg: i think the biggest factor of the current code, is the
cachemanager -- need to guarantee no conflicts w/ the cache key
[8:16pm] zeeg: need a name
[8:16pm] jdunck: when you say cache manager, do you mean like a
regular manager that returns a cacheqs from get_query_set, or is this
something else?
[8:16pm] zeeg: ya it caches all queries
[8:16pm] zeeg: http://www.pastethat.com/?NYxG2
[8:16pm] zeeg: its probably not the best approach
[8:16pm] zeeg: but its what i had working
[8:17pm] zeeg: clean/reset are something im unsure of too
[8:17pm] zeeg: clean can probably be removed
[8:17pm] zeeg: (merged w/ reset more or less)
[8:17pm] zeeg: clean was to delete it from a cache, which... well i
think (now) its a waste to ever delete from the cache
[8:17pm] zeeg: only to set over something, and let memcached delete
[8:17pm] zeeg: but maybe for non-memcached users its useful
[8:18pm] jdunck: django-orm-cache ?  django-cachemanager ?
[8:19pm] jdunck: as for ensuring no collision, simplest to just sort
all args into consistent order, then sha it, no?
[8:20pm] jdunck: or are you trying for readable keys?
[8:20pm] zeeg: im using hash
[8:20pm] zeeg: i didnt want to sha
[8:20pm] zeeg: or md5
[8:20pm] zeeg: as i figured that'd be slow
[8:20pm] jdunck: mebbe sha is too expensive, yeah
[8:20pm] zeeg: but it does sort etc first
[8:20pm] jdunck: there are lots of hash algs out there
[8:20pm] zeeg: http://code.google.com/p/django-orm-cache/
[8:20pm] zeeg: gonna put the cache.py i have up there real quick
[8:22pm] zeeg: k its committed there
[8:22pm] jdunck: but collision is v. important to avoid.
[8:22pm] zeeg: whats your google account name?
[8:22pm] jdunck: [EMAIL PROTECTED]
[8:22pm] zeeg: alright added you on the project
[8:22pm] jdunck: python's sha is in c, i expect
[8:22pm] jdunck: danke
[8:22pm] zeeg: ya we can bench that
[8:22pm] zeeg: i have to run though, ill be home in a few hours
[8:23pm] jdunck: k
[8:23pm] jdunck: interesting hashes
[8:23pm] jdunck: http://www.burtleburtle.net/bob/hash/doobs.html
[8:23pm] jdunck: http://www.azillionmonkeys.com/qed/hash.html
[8:23pm] jdunck: you ok w/ me posting irc archives somewhere?  marty
wanted to be in on discussion
[9:54pm] zeeg: back
[10:05pm] jdunck: hello
[10:06pm] jdunck: so, i was just going to post the irc log to the list
thread, k?
[10:06pm] zeeg: get a cance to look over stuff
[10:06pm] zeeg: ya go ahead
[10:09pm] jdunck: well, i'd seen cachedmodel before.. but its __new__
is the same as dj trunk, right?
[10:10pm] jdunck: you're just using it to automagically supply
objects=CacheManager ?
[10:10pm] zeeg: right
[10:10pm] zeeg: i think new is pretty much identical
[10:10pm] jdunck: eh, i'd just tell people to override objects
[10:10pm] jdunck: that's what we're doing in gis
[10:11pm] zeeg: well then they have to override save/delete also
[10:11pm] zeeg: or i have to use signals [yuck]
[10:11pm] jdunck: hmm
[10:11pm] zeeg: id rather just say "this model is cached" and it work
[10:11pm] zeeg: I tried doing it as a mix-in but you cant override
certain things
[10:11pm] jdunck: i guess it didn't work to just make cachedmodel
derive from models.Model, but not set __metaclass__ ?
[10:12pm] zeeg: iirc correctly theres issues
[10:12pm] zeeg: being thatyou cant properly subclass models
[10:19pm] jdunck: looks like _get_sorted_clause_key is sorting the
joined-up string where
[10:19pm] jdunck: admittedly unlikely to collide, but probly not what
was intended
[10:20pm] jdunck: that is, the 2nd val in the return from
queryset._get_sql_clause is the string form of of the join/where
clause
[10:24pm] jdunck: anyway, i've gotta run soon.
[10:25pm] jdunck: what do you think of invalidating the whole list if a key
[10:25pm] jdunck: if a key is missing from the multiget?
[10:29pm] jdunck: here's the to-do list from chat here and code comments:
[10:29pm] jdunck: Investigate whether ferringb (of curse) supplied
signal performance patch
[10:29pm] jdunck: Add multiget to all cache backends
[10:29pm] jdunck: Add max_key_length, max_value_length to all cache backends
[10:29pm] jdunck: add memcache's replace semantics for
replace-only-if-exists semantics
[10:29pm] jdunck: Support splitting qs values over max_value_length
(in other words, do multiple sets and gets for a single list of
objects if needed)
[10:29pm] jdunck: bench sha vs. (python) hsieh and jenkins
[10:29pm] jdunck: test w/o CachedModel __metaclass__ since that's a bit silly.
[10:29pm] jdunck: invalidate whole list if any key in list is missing
- ask dcramer
[10:29pm] jdunck: All related field descriptors should check cache first
[10:29pm] jdunck: Port to qs-refactor
[10:29pm] zeeg: ya
[10:29pm] zeeg: that was the plan
[10:30pm] zeeg: if key is missing invalidate parent
[10:30pm] zeeg: but ya that sounds good
[10:31pm] jdunck: k
[10:31pm] jdunck: offline-- will post this to list

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Cache Invalidation Proposal -- CachedModel

Reply via email to