So I consulted briefly on a project that was using a semistructured
data store for everything.  Everything was a graph, and they were
having a hard time getting it to all run at a reasonable speed,
because every time they touched a node, they issued a SQL query to
pull all the edges related to it out of the database.  Since typical
pages showed a few important properties from a large number of objects
(think: first and last names of all of your friends), every page
actually copied a substantial fraction of the data out of the
database.

On the other hand, it seemed that if you composed a SQL query (some
horrendous multi-way self-join) that fetched only the needed data from
the database, MySQL's query optimizer would do a good enough job
optimizing the query that it would fetch the needed data in a small
fraction of a second.

I'd experienced this before on a database-backed web site project, and
the cause was that the fetch_friends() method (or get_all_rogue_aps()
or whatever) didn't have enough knowledge of the context it was being
called in to know what other information its caller would want out of
the database; but it had to fetch the results immediately so its
caller could use them.  Using MySQL merely as a triple store, as
Class::RDF does, just made the problem ten times worse.

On my way to the grocery store, I was thinking about this, and about
Erlang's nested "IO lists" of characters for constructing output, and
about HTML templates, and my own implementation of part of the
relational algebra in Python with lazy compilation to SQL ("prototype
SchemeQL/Roe-like thing in Python",
http://lists.canonical.org/pipermail/kragen-hacks/2004-April/000394.html).

And it occurred to me that if you filled in your HTML template with
lazy queries ("for each 'friend' edge, spit out this chunk of HTML
which interpolates the 'firstname' and 'lastname' values from its
context") then you could actually wait until you'd constructed the
whole page and were running the template engine before you hit the
database.  You could run all kinds of imperative code to create these
lazy query chunks and paste them together, but without actually
hitting the database.

That means that you have all the information required to compile
everything down into an efficient SQL query (or two or three).

You can't do this, of course, for information that you have to fetch
from the database in order to direct the flow of some imperative code,
but you can usually avoid doing that an arbitrary number of times per
page.

If you merely want to display each person's last name or not,
according to whether they're your friend, your imperative code can
create a conditional operator node from the "friend" status, the last
name, and the empty string (or whatever you use instead of the last
name).  This will compile into a query that fetches both the friend
status and the last name for each person; the database will fetch all
the last names, even the ones you don't want to display, and generally
this will be pretty efficient.

But for decisions like whether to do a particular update to the
database or to take some other action in the world, you must do
non-lazy database queries from your imperative code, which means they
can't be batched and optimized into a single query.  This usually
isn't the pain point, though.

This approach doesn't break even if your "templating" is doing
arbitrarily complex stuff internally, in whatever imperative or
functional style you like, as long as it doesn't need more data from
the database.

This might be a fun way to build a SnikiSniki-like system.

Reply via email to