This post was just AWESOME! Many thanks for time spent and so much detail.
Gratulacje, na prawde doskonala robota. Thanks Art On Apr 28, 2012 8:14 AM, "Jakub Łopuszański" <[email protected]> wrote: > *There are only two hard things in Computer Science: cache invalidation > and naming things. > -- Phil Karlton > > Hi, I’d like to share with you an algorithm for cache invalidation that > I’ve came up with, and successfully implemented in a real world > application. I am a software architect at nk.pl, a Polish social network > successfully competing with FB, and a founder and backend architect of > Vanisoft, a startup in which I had more freedom to experiment with a > different architecture. > > Perhaps my idea is not original, but I could not find a description of > similar concept anywhere, so even if this is reinventing the wheel, maybe > it will help propagate the knowledge and save somebody else’s time. When I > was designing the backend for Vanisoft’s project ZnaszMnie (currently over > 400k players) all I could find in the internet about ways to cache database > queries results was either wrong (such as no cache invalidation at all), or > ineffective (trying to keep of a list of all cached results and > invalidating them one by one). > > I assume you have more than one front-end machines, some databases and > cache which are reachable from frontends. Perhaps you have local cache on > each front-end machine, which is really fast, but inaccessible from other > machines (an thus subject to cache invalidation problems). If so, then my > approach can correctly provide cached answer to a query by using one > round-trip to global cache, and one to local (the cost of later is usually > neglible), and never serve stale data (and the last part is the most > important for me). > > I assume, that even if your database is not SQL, you could think about > your queries as expressible in SQL, using SELECT, UPDATE, or DELETE. This > is needed only to understand the presentation of the algorithm, which could > be easily adapted to many other cases. > > To keep things simple let us consider a binary relation (a table with two > columns, or a set of points in 2-dimensional space, if you prefer). The > algorithm works for any number of dimensions, but it is easier to grasp for > 2-D case. > > So let’s think about about a table `songs` with columns “song_id” and > “author_id”, where obviously some songs may have several authors and > vice-versa. Again, for simplicity assume, that these are integers, while > the algorithm works for any (serializable) data type. > > For each query, I’d like to introduce a concept of “subspace” on which it > operates. If you choose the interpretation of binary relation as a set of > points in 2-dimensional space, then subspace, is what you get by fixing > some of dimensions. > Some examples -- a list of queries with their subspaces: > INSERT INTO songs (song_id, author_id) VALUES (1,2) :: [song_id=1, > author_id=2] > SELECT COUNT(*) FROM songs WHERE author_id = 7 :: [author=7] > DELETE FROM songs WHERE song_id = 13 :: [song_id=13] > DELETE FROM songs :: [] > > Updates are a bit more complicated, and I’d prefer to think about them as > a DELETE followed by INSERT, for example: > UPDATE songs SET author_id=13 WHERE author_id=7 AND song_id=3 > can be thought of as : > DELETE FROM songs WHERE author_id=7 AND song_id=3 :: [song_id=3, > author_id=7] > INSERT INTO songs (song_id, author_id) VALUES (3,13) :: [song_id=3, > author_id=13] > I know this is not equivalent, and may cause some race conditions, but I > believe it does not really matter in what follows. > > As you can see in our example, a subspace can be, 0, 1 or 2-dimensional, > which depends on the scope on which the query operates. If we are unsure > about the actual scope, we need to upperbound it for my algorithm to work. > For example : > SELECT COUNT(*) FROM songs WHERE song_id < 13 :: [] > here we could not say much about rows which are important for this query, > so we upperbound it with the whole space. Please note how subspace has > nothing to do with the columns returned by select, but rather by the rows > scanned during it. > > In general, the subspace is determined by all equality constraints found > in the query. This is such a trivial correspondence, that you can (and > should) let your backend library deduce it automatically from the query > string. Or if you use some object oriented query builder, you should get it > for free. > > For 2-dimensional space, there are 4 kinds of subspaces : > - the whole space, given by no constraints at all : [] > - a point, given by two constraints, for example [song_id=3, author_id=7] > - a vertical, or horizontal line, given by one constraint, for example > [song_id=3], or [author_id=7] > > If we have two queries A and B, such that A is a write (DELETE or INSERT), > and B is a read (SELECT), then a cached result for B should be invalidated > after query A, if subspace of A intersects subspace of B. (Actually this is > a sufficient condition, not necessary). > For example a query > DELETE FROM songs WHERE author_id=7 AND song_id=3 :: [song_id=3, > author_id=7] > affects results of > SELECT COUNT(*) FROM songs WHERE author_id = 7 :: [author=7] > and you can tell it just by comparing [author=7] with [song_id=3, > author_id=7], without looking at the queries. Of course this is some safe > approximation : sometimes cache invalidation is not necessary, but > performing it will ensure the correctness of algorithm. > > Now, notice, that there may be infinitely many queries, which may need to > be invalidated, as there are many subspaces that intersect with a given > one. For example a query with subspace [author=7] must invalidate all > queries with subspace [song_id=1], as well as [song_id=2], etc... Moreover > there are infinitely many possible queries with the same subspace. > Frameworks that try to keep track of all intersecting queries are doomed to > either store everything in a large, permanent storage (which kind of > defeats the purpose of in-memory cache to me), or eventually forget about a > query which needs to be invalidated (which is an error to me). > > The solution is much simpler though, and does not involve operating on > infinitely many cache keys... only exponentially many. But, don’t worry, it > is exponential in the number of dimensions, not the size of dataset. For > example for a binary relation we need to invalidate 2^2 = 4 keys. If you > have a table with many columns, please don’t be alarmed neither -- you just > need to choose a few columns on which you’d like to focus, and crop the > subspace to those few columns -- you will invalidate thinks more often than > necessary, but will not get hit by exponential blow. I’ve never experienced > a case with more than 3 relevant columns, and 2^3 = 8 is not a big number. > > We need some shorter notation for subspaces. Let us write them as > 2-dimensional vectors (similar to rows in the table). Instead of > [song_id=3,author_id=7], let us write (3,7), and for [song_id=3], we will > use (3,*). Whole subspace is (*,*), and [author_id=4] is just (*,4). > > Now the tricky part. Let us introduce a new wildcard symbol “?”, which > will mean something along lines of “all subspaces which have a number > here”. For example (?,3) is a shorthand for the infinite set of subspaces > {(0,3),(1,3),(2,3),...}. There is no particular query that has subspace > (?,3). This artificial notation corresponds to all queries which have fixed > second dimension. > > I assume that your global cache (such as memcache) supports atomic > increment operation, which is very convenient for the task of maintaining a > “revision number”, which we will associate with each subspace (artificial > or not). When I say “with each”, I don’t mean buying infinitely large > computer -- we will just allocate memory for subspaces actually used in > production, and those unused (or rarely used, if you have LRU garbage > collector) will simply take no memory at all. > A short note about evictions : if get or increment results in a miss, > simply set the revision to current timestamp. > > I also assume that you know the idea of generational caching, that is, > using a key name with appended revision to store cached data (such as > “get_songs_count.123”). This way a single operation of incrementing a > revision, can invalidate all data stored at key names which depended on it. > This seems like a waste of memory, but LRU garbage collector can take care > about leftovers. The nice thing about it, is that while a single integer > (the revision number) must be stored in a global cache, the potentially > large query result can be stored in local cache, from where it can be > quickly retrieved without any bottlenecks of global cache. This is actually > the only way (except for using silly small TTLs) for invalidating data > stored in local cache that I know and use. > > Now the algorithm. Or actually two algorithms, one for read-only queries, > and one for writes. > > When you perform a read query A, first determine its subspace. I assume > you have d columns, so the subspace can be expressed as a d-arry vector of > numbers and stars, as explained above. For example a query “SELECT song_id > FROM songs WHERE author_id = 3” corresponds to a vector (*,3). > Then compute at most all possible variations of this d-arry vector, by > using a single substitution rule : you can put “?” in place of a number. > For example, if your original vector was (*,3), then it results in two > vectors: (*,?) and the original. If on the other hand, your query had > subspace (7,3), then you would end up with four variations: > (7,3),(?,3),(7,?),(?,?). > Then, using a bulk (multi)get, fetch from the global cache revisions for > those (at most 2^d) subspaces. > Concatenate revisions into one big revision (for example 123.3.114.14). > Then fetch from the local cache the cached result of the query, which > should reside at the key which is a concatenation of the query string (or > hash of it if you prefer) and the big concatenated revision. > If there is a miss in the local cache, try in the global cache. > If there is a miss in the global cache, try in the database. > Then add missing keys to caches as needed. > > When you perform a write query B, determine its subspace. > Then compute exactly 2^d possible variations of it, by using two > substitution rules: > a) you can put “?” in place of “*” > b) you can put “*” in place of any number > For example for a subspace (*,3) you’ll end up with four variants: > (*,3),(?,3),(*,*),(?,*). > Then perform the query B. > Finally increment revisions for all these variants. > > As you can see, updates are a bit slower than reads (which in my opinion > is usual, and nothing to be worried about). Some caches implementations > allow bulk increments in one round-trip, in which case this might actually > be faster then read scenario, as you do not contact local cache at all. > > Now, some explanation, of why it works. > Imagine a bipartite graph which on the left hand side has one vertex per > each possible subspace of a write query, and on the right side has vertices > corresponding to subspaces of read queries. Actually both sets are equal, > but we will focus on edges. > Edge goes from left to right, if a query on the left side affects results > of a query on the right side. As said before, both sets are infinite, but > that’s not the problem. There are infinitely many edges, but it’s also not > bad. What’s bad is that there are nodes on the left side with the infinite > degree, which means, we need to invalidate infinitely many queries. What > the above tricky algorithm does, is adding a third layer to the graph, in > the middle between the two, such that the transitive closure of the > resulting graph is still the same (in other words: you can still get by > using two edges anywhere you could by one edge in the original graph), yet > each node on the left, and each node on the right, have finite (actually > constant) degree. This middle layer corresponds to the artificial subspaces > with “?” marks, and serves as a connecting hub for all the mess. Now, when > a query on the left executes, it needs to inform only its (small number of) > neighbours about the change, moving the burden of reading this information > to the right. That is, a query on the right side needs to check if there is > a message in the “inbox” in the middle layer. So you can think about it as > a cooperation where the left query makes one step forward, and the right > query does a one step back, to meet at the central place, and pass the > important information about the invalidation of cache. > > Another point of view is that two subspaces (v[1],...,v[d]) and > (u[1],...,u[d]) intersect each other if and only if they agree on all > positions without stars. Or in other words: there is no such index k, that > v[k]<>u[k] and v[k]<>* and u[k]<>*. > So, given a vector (v[1],...,v[d]), you can predict that any intersecting > subspace will be described by a vector (u[1],...,u[d]) of some particular > form. That is, it can be achieved from (v[1],...,v[d]) by using some simple > substitution rules: > a) you can change any number to a star > b) you can change any star to any number > c) you can not use both above rules for the same index k > The problematic rule “b)” allows infinitely many possibilities. > To overcome this we introduced the question mark symbol, which represents > “any number” without stating any particular number. A path of operations > leading from (v[1],..,v[d]) to (u[1],..,u[d]) was then explored from the > other end, by trying to substitute numbers in (u[1],...,u[d]) by question > marks. If we could reach the same term from both ends, then we know that > there is a path from one end to the other, even though there is an infinite > maze in the middle. > > As pointed before, you can use this technique for higher number of > columns, or keep focus on just a few of them at the expense of premature > invalidation. You can easily use this for a topology without local caches, > by storing everything in the global cache. You can adapt the technique to > columns storing strings, or nulls, or whatever. You can use it for noSQL > databases if you need to. But this is not the end of possible improvements. > > If you know in advance what kind of write queries you will perform, that > is -- if you know what subspaces will they have, you can “trim” the graph, > to a smaller one, which will result in faster reads and writes, as you will > not have to fetch that many (is 4 many ?) keys. > > For example if your modifications are just single-row operations (INSERT > one row, DELETE one row), then your modifying queries subspaces never have > stars in them. You can immediately see from the substitution rules, that > you will never-ever increment revisions of artifcial subspaces (these with > question marks), as “?” can be made only from “*”. If so, then it is not > necessary to fetch them, or even think about them. You can safely ignore > them. > Now, read queries only depend on revisions of vectors which differ from > the original vector by replacing something with question marks. Since such > revisions are not used in this setting, you only need to fetch revision of > the original vector, which obviously had no question marks. This allows for > quite fast and simple implementation, which fetches only single key from > global cache. Actually my original framework forbid any other modifications > than row-by-row just to get this extra-fast reads. I believe that in many > real-world scenarios you never INSERT more than a single row at once. I’ve > seen cases where you had an urge to DELETE many rows at once (for example > when cleaning up dependent records), and may feel bad about the idea of > doing it row-by-row. Think about amortized cost though: each of these rows > you delete had had to be added at some point in the time, and if you > already feel comfortable with the idea of adding rows one-by-one, then you > already paid enough credit to not worry about the deletes. > > That’s it folks. Now, which frameworks implement this idea, and in which > paper I can read about it, and how many years did I overslept?* >
