Algorithm for automatic cache invalidation

Jakub Łopuszański Fri, 27 Apr 2012 15:14:53 -0700

*There are only two hard things in Computer Science: cache invalidation and 
naming things.
-- Phil Karlton


Hi, I’d like to share with you an algorithm for cache invalidation that 
I’ve came up with, and successfully implemented in a real world 
application. I am a software architect at nk.pl, a Polish social network 
successfully competing with FB, and a founder and backend architect of 
Vanisoft, a startup in which I had more freedom to experiment with a 
different architecture. 

Perhaps my idea is not original, but I could not find a description of 
similar concept anywhere, so even if this is reinventing the wheel, maybe 
it will help propagate the knowledge and save somebody else’s time. When I 
was designing the backend for Vanisoft’s project ZnaszMnie (currently over 
400k players) all I could find in the internet about ways to cache database 
queries results was either wrong (such as no cache invalidation at all), or 
ineffective (trying to keep of a list of all cached results and 
invalidating them one by one).

I assume you have more than one front-end machines, some databases and 
cache which are reachable from frontends. Perhaps you have local cache on 
each front-end machine, which is really fast, but inaccessible from other 
machines (an thus subject to cache invalidation problems). If so, then my 
approach can correctly provide cached answer to a query by using one 
round-trip to global cache, and one to local (the cost of later is usually 
neglible), and never serve stale data (and the last part is the most 
important for me).

I assume, that even if your database is not SQL, you could think about your 
queries as expressible in SQL, using SELECT, UPDATE, or DELETE. This is 
needed only to understand the presentation of the algorithm, which could be 
easily adapted to many other cases.

To keep things simple let us consider a binary relation (a table with two 
columns, or a set of points in 2-dimensional space, if you prefer). The 
algorithm works for any number of dimensions, but it is easier to grasp for 
2-D case. 

So let’s think about about a table `songs` with columns “song_id” and 
“author_id”, where obviously some songs may have several authors and 
vice-versa. Again, for simplicity assume, that these are integers, while 
the algorithm works for any (serializable) data type.

For each query, I’d like to introduce a concept of “subspace” on which it 
operates. If you choose the interpretation of binary relation as a set of 
points in 2-dimensional space, then subspace, is what you get by fixing 
some of dimensions.
Some examples -- a list of queries with their subspaces:
INSERT INTO songs (song_id, author_id) VALUES (1,2)  ::  [song_id=1, 
author_id=2]
SELECT COUNT(*) FROM songs WHERE author_id = 7 ::  [author=7]
DELETE FROM songs WHERE song_id = 13 :: [song_id=13]
DELETE FROM songs :: []

Updates are a bit more complicated, and I’d prefer to think about them as a 
DELETE followed by INSERT, for example:
UPDATE songs SET author_id=13 WHERE author_id=7 AND song_id=3 
can be thought of as :
DELETE FROM songs WHERE author_id=7 AND song_id=3 :: [song_id=3, 
author_id=7]
INSERT INTO songs (song_id, author_id) VALUES (3,13)   :: [song_id=3, 
author_id=13]
I know this is not equivalent, and may cause some race conditions, but I 
believe it does not really matter in what follows.

As you can see in our example, a subspace can be, 0, 1 or 2-dimensional, 
which depends on the scope on which the query operates. If we are unsure 
about the actual scope, we need to upperbound it for my algorithm to work. 
For example :
SELECT COUNT(*) FROM songs WHERE song_id < 13 :: []
here we could not say much about rows which are important for this query, 
so we upperbound it with the whole space. Please note how subspace has 
nothing to do with the columns returned by select, but rather by the rows 
scanned during it.

In general, the subspace is determined by all equality constraints found in 
the query. This is such a trivial correspondence, that you can (and should) 
let your backend library deduce it automatically from the query string. Or 
if you use some object oriented query builder, you should get it for free.

For 2-dimensional space, there are 4 kinds of subspaces :
- the whole space, given by no constraints at all : []
- a point, given by two constraints, for example [song_id=3, author_id=7]
- a vertical, or horizontal line, given by one constraint, for example 
[song_id=3], or [author_id=7]

If we have two queries A and B, such that A is a write (DELETE or INSERT), 
and B is a read (SELECT), then a cached result for B should be invalidated 
after query A, if subspace of A intersects subspace of B. (Actually this is 
a sufficient condition, not necessary).
For example a query
DELETE FROM songs WHERE author_id=7 AND song_id=3 :: [song_id=3, 
author_id=7]
affects results of
SELECT COUNT(*) FROM songs WHERE author_id = 7 ::  [author=7]
and you can tell it just by comparing [author=7] with [song_id=3, 
author_id=7], without looking at the queries. Of course this is some safe 
approximation : sometimes cache invalidation is not necessary, but 
performing it will ensure the correctness of algorithm.

Now, notice, that there may be infinitely many queries, which may need to 
be invalidated, as there are many subspaces that intersect with a given 
one. For example a query with subspace [author=7] must invalidate all 
queries with subspace [song_id=1], as well as [song_id=2], etc... Moreover 
there are infinitely many possible queries with the same subspace. 
Frameworks that try to keep track of all intersecting queries are doomed to 
either store everything in a large, permanent storage (which kind of 
defeats the purpose of in-memory cache to me), or eventually forget about a 
query which needs to be invalidated (which is an error to me).

The solution is much simpler though, and does not involve operating on 
infinitely many cache keys... only exponentially many. But, don’t worry, it 
is exponential in the number of dimensions, not the size of dataset. For 
example for a binary relation we need to invalidate 2^2 = 4 keys. If you 
have a table with many columns, please don’t be alarmed neither -- you just 
need to choose a few columns on which you’d like to focus, and crop the 
subspace to those few columns -- you will invalidate thinks more often than 
necessary, but will not get hit by exponential blow. I’ve never experienced 
a case with more than 3 relevant columns, and 2^3 = 8 is not a big number.

We need some shorter notation for subspaces. Let us write them as 
2-dimensional vectors (similar to rows in the table). Instead of 
[song_id=3,author_id=7], let us write (3,7), and for [song_id=3], we will 
use (3,*). Whole subspace is (*,*), and [author_id=4] is just (*,4).

Now the tricky part. Let us introduce a new wildcard symbol “?”, which will 
mean something along lines of “all subspaces which have a number here”. For 
example (?,3) is a shorthand for the infinite set of subspaces 
{(0,3),(1,3),(2,3),...}. There is no particular query that has subspace 
(?,3). This artificial notation corresponds to all queries which have fixed 
second dimension.

I assume that your global cache (such as memcache) supports atomic 
increment operation, which is very convenient for the task of maintaining a 
“revision number”, which we will associate with each subspace (artificial 
or not). When I say “with each”, I don’t mean buying infinitely large 
computer -- we will just allocate memory for subspaces actually used in 
production, and those unused (or rarely used, if you have LRU garbage 
collector) will simply take no memory at all.
A short note about evictions : if get or increment results in a miss, 
simply set the revision to current timestamp.

I also assume that you know the idea of generational caching, that is, 
using a key name with appended revision to store cached data (such as 
“get_songs_count.123”). This way a single operation of incrementing a 
revision, can invalidate all data stored at key names which depended on it. 
This seems like a waste of memory, but LRU garbage collector can take care 
about leftovers. The nice thing about it, is that while a single integer 
(the revision number) must be stored in a global cache, the potentially 
large query result can be stored in local cache, from where it can be 
quickly retrieved without any bottlenecks of global cache. This is actually 
the only way (except for using silly small TTLs) for invalidating data 
stored in local cache that I know and use.

Now the algorithm. Or actually two algorithms, one for read-only queries, 
and one for writes.

When you perform a read query A, first determine its subspace. I assume you 
have d columns, so the subspace can be expressed as a d-arry vector of 
numbers and stars, as explained above. For example a query “SELECT song_id 
FROM songs WHERE author_id = 3” corresponds to a vector (*,3).
Then compute at most all possible variations of this d-arry vector, by 
using a single substitution rule : you can put “?” in place of a number. 
For example, if your original vector was (*,3), then it results in two 
vectors: (*,?) and the original. If on the other hand, your query had 
subspace (7,3), then you would end up with four variations: 
(7,3),(?,3),(7,?),(?,?).
Then, using a bulk (multi)get, fetch from the global cache revisions for 
those (at most 2^d) subspaces.
Concatenate revisions into one big revision (for example 123.3.114.14).
Then fetch from the local cache the cached result of the query, which 
should reside at the key which is a concatenation of the query string (or 
hash of it if you prefer) and the big concatenated revision.
If there is a miss in the local cache, try in the global cache.
If there is a miss in the global cache, try in the database.
Then add missing keys to caches as needed.

When you perform a write query B, determine its subspace.
Then compute exactly 2^d possible variations of it, by using two 
substitution rules:
a) you can put “?” in place of “*”
b) you can put “*” in place of any number
For example for a subspace (*,3) you’ll end up with four variants: 
(*,3),(?,3),(*,*),(?,*).
Then perform the query B.
Finally increment revisions for all these variants.

As you can see, updates are a bit slower than reads (which in my opinion is 
usual, and nothing to be worried about). Some caches implementations allow 
bulk increments in one round-trip, in which case this might actually be 
faster then read scenario, as you do not contact local cache at all.

Now, some explanation, of why it works.
Imagine a bipartite graph which on the left hand side has one vertex per 
each possible subspace of a write query, and on the right side has vertices 
corresponding to subspaces of read queries. Actually both sets are equal, 
but we will focus on edges.
Edge goes from left to right, if a query on the left side affects results 
of a query on the right side. As said before, both sets are infinite, but 
that’s not the problem. There are infinitely many edges, but it’s also not 
bad. What’s bad is that there are nodes on the left side with the infinite 
degree, which means, we need to invalidate infinitely many queries. What 
the above tricky algorithm does, is adding a third layer to the graph, in 
the middle between the two, such that the transitive closure of the 
resulting graph is still the same (in other words: you can still get by 
using two edges anywhere you could by one edge in the original graph), yet 
each node on the left, and each node on the right, have finite (actually 
constant) degree. This middle layer corresponds to the artificial subspaces 
with “?” marks, and serves as a connecting hub for all the mess. Now, when 
a query on the left executes, it needs to inform only its (small number of) 
neighbours about the change, moving the burden of reading this information 
to the right. That is, a query on the right side needs to check if there is 
a message in the “inbox” in the middle layer. So you can think about it as 
a cooperation where the left query makes one step forward, and the right 
query does a one step back, to meet at the central place, and pass the 
important information about the invalidation of cache.

Another point of view is that two subspaces (v[1],...,v[d]) and 
(u[1],...,u[d]) intersect each other if and only if they agree on all 
positions without stars. Or in other words: there is no such index k, that 
v[k]<>u[k] and v[k]<>* and u[k]<>*. 
So, given a vector (v[1],...,v[d]), you can predict that any intersecting 
subspace will be described by a vector (u[1],...,u[d]) of some particular 
form. That is, it can be achieved from (v[1],...,v[d]) by using some simple 
substitution rules:
a) you can change any number to a star
b) you can change any star to any number
c) you can not use both above rules for the same index k
The problematic rule “b)” allows infinitely many possibilities. 
To overcome this we introduced the question mark symbol, which represents 
“any number” without stating any particular number. A path of operations 
leading from (v[1],..,v[d]) to (u[1],..,u[d]) was then explored from the 
other end, by trying to substitute numbers in (u[1],...,u[d]) by question 
marks. If we could reach the same term from both ends, then we know that 
there is a path from one end to the other, even though there is an infinite 
maze in the middle.

As pointed before, you can use this technique for higher number of columns, 
or keep focus on just a few of them at the expense of premature 
invalidation. You can easily use this for a topology without local caches, 
by storing everything in the global cache. You can adapt the technique to 
columns storing strings, or nulls, or whatever. You can use it for noSQL 
databases if you need to. But this is not the end of possible improvements.

If you know in advance what kind of write queries you will perform, that is 
-- if you know what subspaces will they have, you can “trim” the graph, to 
a smaller one, which will result in faster reads and writes, as you will 
not have to fetch that many (is 4 many ?) keys.

For example if your modifications are just single-row operations (INSERT 
one row, DELETE one row), then your modifying queries subspaces never have 
stars in them. You can immediately see from the substitution rules, that 
you will never-ever increment revisions of artifcial subspaces (these with 
question marks), as “?” can be made only from “*”. If so, then it is not 
necessary to fetch them, or even think about them. You can safely ignore 
them. 
Now, read queries only depend on revisions of vectors which differ from the 
original vector by replacing something with question marks. Since such 
revisions are not used in this setting, you only need to fetch revision of 
the original vector, which obviously had no question marks. This allows for 
quite fast and simple implementation, which fetches only single key from 
global cache. Actually my original framework forbid any other modifications 
than row-by-row just to get this extra-fast reads. I believe that in many 
real-world scenarios you never INSERT more than a single row at once. I’ve 
seen cases where you had an urge to DELETE many rows at once (for example 
when cleaning up dependent records), and may feel bad about the idea of 
doing it row-by-row. Think about amortized cost though: each of these rows 
you delete had had to be added at some point in the time, and if you 
already feel comfortable with the idea of adding rows one-by-one, then you 
already paid enough credit to not worry about the deletes.

That’s it folks. Now, which frameworks implement this idea, and in which 
paper I can read about it, and how many years did I overslept?*

Algorithm for automatic cache invalidation

Reply via email to