This post was just AWESOME!

Many thanks for time spent and so much detail.

Gratulacje, na prawde doskonala robota.

Thanks

Art
On Apr 28, 2012 8:14 AM, "Jakub Łopuszański" <[email protected]> wrote:

> *There are only two hard things in Computer Science: cache invalidation
> and naming things.
> -- Phil Karlton
>
> Hi, I’d like to share with you an algorithm for cache invalidation that
> I’ve came up with, and successfully implemented in a real world
> application. I am a software architect at nk.pl, a Polish social network
> successfully competing with FB, and a founder and backend architect of
> Vanisoft, a startup in which I had more freedom to experiment with a
> different architecture.
>
> Perhaps my idea is not original, but I could not find a description of
> similar concept anywhere, so even if this is reinventing the wheel, maybe
> it will help propagate the knowledge and save somebody else’s time. When I
> was designing the backend for Vanisoft’s project ZnaszMnie (currently over
> 400k players) all I could find in the internet about ways to cache database
> queries results was either wrong (such as no cache invalidation at all), or
> ineffective (trying to keep of a list of all cached results and
> invalidating them one by one).
>
> I assume you have more than one front-end machines, some databases and
> cache which are reachable from frontends. Perhaps you have local cache on
> each front-end machine, which is really fast, but inaccessible from other
> machines (an thus subject to cache invalidation problems). If so, then my
> approach can correctly provide cached answer to a query by using one
> round-trip to global cache, and one to local (the cost of later is usually
> neglible), and never serve stale data (and the last part is the most
> important for me).
>
> I assume, that even if your database is not SQL, you could think about
> your queries as expressible in SQL, using SELECT, UPDATE, or DELETE. This
> is needed only to understand the presentation of the algorithm, which could
> be easily adapted to many other cases.
>
> To keep things simple let us consider a binary relation (a table with two
> columns, or a set of points in 2-dimensional space, if you prefer). The
> algorithm works for any number of dimensions, but it is easier to grasp for
> 2-D case.
>
> So let’s think about about a table `songs` with columns “song_id” and
> “author_id”, where obviously some songs may have several authors and
> vice-versa. Again, for simplicity assume, that these are integers, while
> the algorithm works for any (serializable) data type.
>
> For each query, I’d like to introduce a concept of “subspace” on which it
> operates. If you choose the interpretation of binary relation as a set of
> points in 2-dimensional space, then subspace, is what you get by fixing
> some of dimensions.
> Some examples -- a list of queries with their subspaces:
> INSERT INTO songs (song_id, author_id) VALUES (1,2)  ::  [song_id=1,
> author_id=2]
> SELECT COUNT(*) FROM songs WHERE author_id = 7 ::  [author=7]
> DELETE FROM songs WHERE song_id = 13 :: [song_id=13]
> DELETE FROM songs :: []
>
> Updates are a bit more complicated, and I’d prefer to think about them as
> a DELETE followed by INSERT, for example:
> UPDATE songs SET author_id=13 WHERE author_id=7 AND song_id=3
> can be thought of as :
> DELETE FROM songs WHERE author_id=7 AND song_id=3 :: [song_id=3,
> author_id=7]
> INSERT INTO songs (song_id, author_id) VALUES (3,13)   :: [song_id=3,
> author_id=13]
> I know this is not equivalent, and may cause some race conditions, but I
> believe it does not really matter in what follows.
>
> As you can see in our example, a subspace can be, 0, 1 or 2-dimensional,
> which depends on the scope on which the query operates. If we are unsure
> about the actual scope, we need to upperbound it for my algorithm to work.
> For example :
> SELECT COUNT(*) FROM songs WHERE song_id < 13 :: []
> here we could not say much about rows which are important for this query,
> so we upperbound it with the whole space. Please note how subspace has
> nothing to do with the columns returned by select, but rather by the rows
> scanned during it.
>
> In general, the subspace is determined by all equality constraints found
> in the query. This is such a trivial correspondence, that you can (and
> should) let your backend library deduce it automatically from the query
> string. Or if you use some object oriented query builder, you should get it
> for free.
>
> For 2-dimensional space, there are 4 kinds of subspaces :
> - the whole space, given by no constraints at all : []
> - a point, given by two constraints, for example [song_id=3, author_id=7]
> - a vertical, or horizontal line, given by one constraint, for example
> [song_id=3], or [author_id=7]
>
> If we have two queries A and B, such that A is a write (DELETE or INSERT),
> and B is a read (SELECT), then a cached result for B should be invalidated
> after query A, if subspace of A intersects subspace of B. (Actually this is
> a sufficient condition, not necessary).
> For example a query
> DELETE FROM songs WHERE author_id=7 AND song_id=3 :: [song_id=3,
> author_id=7]
> affects results of
> SELECT COUNT(*) FROM songs WHERE author_id = 7 ::  [author=7]
> and you can tell it just by comparing [author=7] with [song_id=3,
> author_id=7], without looking at the queries. Of course this is some safe
> approximation : sometimes cache invalidation is not necessary, but
> performing it will ensure the correctness of algorithm.
>
> Now, notice, that there may be infinitely many queries, which may need to
> be invalidated, as there are many subspaces that intersect with a given
> one. For example a query with subspace [author=7] must invalidate all
> queries with subspace [song_id=1], as well as [song_id=2], etc... Moreover
> there are infinitely many possible queries with the same subspace.
> Frameworks that try to keep track of all intersecting queries are doomed to
> either store everything in a large, permanent storage (which kind of
> defeats the purpose of in-memory cache to me), or eventually forget about a
> query which needs to be invalidated (which is an error to me).
>
> The solution is much simpler though, and does not involve operating on
> infinitely many cache keys... only exponentially many. But, don’t worry, it
> is exponential in the number of dimensions, not the size of dataset. For
> example for a binary relation we need to invalidate 2^2 = 4 keys. If you
> have a table with many columns, please don’t be alarmed neither -- you just
> need to choose a few columns on which you’d like to focus, and crop the
> subspace to those few columns -- you will invalidate thinks more often than
> necessary, but will not get hit by exponential blow. I’ve never experienced
> a case with more than 3 relevant columns, and 2^3 = 8 is not a big number.
>
> We need some shorter notation for subspaces. Let us write them as
> 2-dimensional vectors (similar to rows in the table). Instead of
> [song_id=3,author_id=7], let us write (3,7), and for [song_id=3], we will
> use (3,*). Whole subspace is (*,*), and [author_id=4] is just (*,4).
>
> Now the tricky part. Let us introduce a new wildcard symbol “?”, which
> will mean something along lines of “all subspaces which have a number
> here”. For example (?,3) is a shorthand for the infinite set of subspaces
> {(0,3),(1,3),(2,3),...}. There is no particular query that has subspace
> (?,3). This artificial notation corresponds to all queries which have fixed
> second dimension.
>
> I assume that your global cache (such as memcache) supports atomic
> increment operation, which is very convenient for the task of maintaining a
> “revision number”, which we will associate with each subspace (artificial
> or not). When I say “with each”, I don’t mean buying infinitely large
> computer -- we will just allocate memory for subspaces actually used in
> production, and those unused (or rarely used, if you have LRU garbage
> collector) will simply take no memory at all.
> A short note about evictions : if get or increment results in a miss,
> simply set the revision to current timestamp.
>
> I also assume that you know the idea of generational caching, that is,
> using a key name with appended revision to store cached data (such as
> “get_songs_count.123”). This way a single operation of incrementing a
> revision, can invalidate all data stored at key names which depended on it.
> This seems like a waste of memory, but LRU garbage collector can take care
> about leftovers. The nice thing about it, is that while a single integer
> (the revision number) must be stored in a global cache, the potentially
> large query result can be stored in local cache, from where it can be
> quickly retrieved without any bottlenecks of global cache. This is actually
> the only way (except for using silly small TTLs) for invalidating data
> stored in local cache that I know and use.
>
> Now the algorithm. Or actually two algorithms, one for read-only queries,
> and one for writes.
>
> When you perform a read query A, first determine its subspace. I assume
> you have d columns, so the subspace can be expressed as a d-arry vector of
> numbers and stars, as explained above. For example a query “SELECT song_id
> FROM songs WHERE author_id = 3” corresponds to a vector (*,3).
> Then compute at most all possible variations of this d-arry vector, by
> using a single substitution rule : you can put “?” in place of a number.
> For example, if your original vector was (*,3), then it results in two
> vectors: (*,?) and the original. If on the other hand, your query had
> subspace (7,3), then you would end up with four variations:
> (7,3),(?,3),(7,?),(?,?).
> Then, using a bulk (multi)get, fetch from the global cache revisions for
> those (at most 2^d) subspaces.
> Concatenate revisions into one big revision (for example 123.3.114.14).
> Then fetch from the local cache the cached result of the query, which
> should reside at the key which is a concatenation of the query string (or
> hash of it if you prefer) and the big concatenated revision.
> If there is a miss in the local cache, try in the global cache.
> If there is a miss in the global cache, try in the database.
> Then add missing keys to caches as needed.
>
> When you perform a write query B, determine its subspace.
> Then compute exactly 2^d possible variations of it, by using two
> substitution rules:
> a) you can put “?” in place of “*”
> b) you can put “*” in place of any number
> For example for a subspace (*,3) you’ll end up with four variants:
> (*,3),(?,3),(*,*),(?,*).
> Then perform the query B.
> Finally increment revisions for all these variants.
>
> As you can see, updates are a bit slower than reads (which in my opinion
> is usual, and nothing to be worried about). Some caches implementations
> allow bulk increments in one round-trip, in which case this might actually
> be faster then read scenario, as you do not contact local cache at all.
>
> Now, some explanation, of why it works.
> Imagine a bipartite graph which on the left hand side has one vertex per
> each possible subspace of a write query, and on the right side has vertices
> corresponding to subspaces of read queries. Actually both sets are equal,
> but we will focus on edges.
> Edge goes from left to right, if a query on the left side affects results
> of a query on the right side. As said before, both sets are infinite, but
> that’s not the problem. There are infinitely many edges, but it’s also not
> bad. What’s bad is that there are nodes on the left side with the infinite
> degree, which means, we need to invalidate infinitely many queries. What
> the above tricky algorithm does, is adding a third layer to the graph, in
> the middle between the two, such that the transitive closure of the
> resulting graph is still the same (in other words: you can still get by
> using two edges anywhere you could by one edge in the original graph), yet
> each node on the left, and each node on the right, have finite (actually
> constant) degree. This middle layer corresponds to the artificial subspaces
> with “?” marks, and serves as a connecting hub for all the mess. Now, when
> a query on the left executes, it needs to inform only its (small number of)
> neighbours about the change, moving the burden of reading this information
> to the right. That is, a query on the right side needs to check if there is
> a message in the “inbox” in the middle layer. So you can think about it as
> a cooperation where the left query makes one step forward, and the right
> query does a one step back, to meet at the central place, and pass the
> important information about the invalidation of cache.
>
> Another point of view is that two subspaces (v[1],...,v[d]) and
> (u[1],...,u[d]) intersect each other if and only if they agree on all
> positions without stars. Or in other words: there is no such index k, that
> v[k]<>u[k] and v[k]<>* and u[k]<>*.
> So, given a vector (v[1],...,v[d]), you can predict that any intersecting
> subspace will be described by a vector (u[1],...,u[d]) of some particular
> form. That is, it can be achieved from (v[1],...,v[d]) by using some simple
> substitution rules:
> a) you can change any number to a star
> b) you can change any star to any number
> c) you can not use both above rules for the same index k
> The problematic rule “b)” allows infinitely many possibilities.
> To overcome this we introduced the question mark symbol, which represents
> “any number” without stating any particular number. A path of operations
> leading from (v[1],..,v[d]) to (u[1],..,u[d]) was then explored from the
> other end, by trying to substitute numbers in (u[1],...,u[d]) by question
> marks. If we could reach the same term from both ends, then we know that
> there is a path from one end to the other, even though there is an infinite
> maze in the middle.
>
> As pointed before, you can use this technique for higher number of
> columns, or keep focus on just a few of them at the expense of premature
> invalidation. You can easily use this for a topology without local caches,
> by storing everything in the global cache. You can adapt the technique to
> columns storing strings, or nulls, or whatever. You can use it for noSQL
> databases if you need to. But this is not the end of possible improvements.
>
> If you know in advance what kind of write queries you will perform, that
> is -- if you know what subspaces will they have, you can “trim” the graph,
> to a smaller one, which will result in faster reads and writes, as you will
> not have to fetch that many (is 4 many ?) keys.
>
> For example if your modifications are just single-row operations (INSERT
> one row, DELETE one row), then your modifying queries subspaces never have
> stars in them. You can immediately see from the substitution rules, that
> you will never-ever increment revisions of artifcial subspaces (these with
> question marks), as “?” can be made only from “*”. If so, then it is not
> necessary to fetch them, or even think about them. You can safely ignore
> them.
> Now, read queries only depend on revisions of vectors which differ from
> the original vector by replacing something with question marks. Since such
> revisions are not used in this setting, you only need to fetch revision of
> the original vector, which obviously had no question marks. This allows for
> quite fast and simple implementation, which fetches only single key from
> global cache. Actually my original framework forbid any other modifications
> than row-by-row just to get this extra-fast reads. I believe that in many
> real-world scenarios you never INSERT more than a single row at once. I’ve
> seen cases where you had an urge to DELETE many rows at once (for example
> when cleaning up dependent records), and may feel bad about the idea of
> doing it row-by-row. Think about amortized cost though: each of these rows
> you delete had had to be added at some point in the time, and if you
> already feel comfortable with the idea of adding rows one-by-one, then you
> already paid enough credit to not worry about the deletes.
>
> That’s it folks. Now, which frameworks implement this idea, and in which
> paper I can read about it, and how many years did I overslept?*
>

Reply via email to