Make for URLs, or dependency-directed compositing

Kragen Sitaker Mon, 28 Nov 2005 00:37:26 -0800

(3800 words, terribly sorry)

Lots of web pages are computed dynamically from some set of resources.
For example, a blog's front page is computed dynamically from a
(potentially large) set of blog posts.  The display of an RSS aggregator,
such as the LiveJournal friends page, is computed dynamically from a
subscription list and a bunch of friends' blogs.  Many web pages use some
kind of templating system, so the page consists at least of "content"
and the template.


Performance problems in web page compositing
--------------------------------------------

These applications have a set of performance problems.  Retrieving the
resources they depend on can take time, use bandwidth, and cost other
people money; dynamically generating the web pages themselves uses CPU
time; and all of this often happens while a human being is impatiently
waiting for a response.

Dependency-directed software builds: make
-----------------------------------------

Normally people write software as a human-readable "source code file"
which gets "compiled" (by a slow program) into an "object file" containing
"machine code", which then gets "linked" with other object files (by a
fast program) into an "executable" which they can run.  When they make
changes to the source code, they often want to see the results of their
changes, and they have to rerun this whole process.  But typically, they
haven't changed all the source files, so they only have to recompile
some of the source files.

There's a widely used program called "make" that figures out which
object files are out of date and need to be recompiled from source.
We write a "Makefile" which gives the command to build each object
file, and the command to link the final executable, along with a list
of source files that go into that object file (or object files that go
into that executable.)  If any of these "dependencies" are newer than the
"target", then "make" reruns the command, which "rebuilds" the "target"
from the data in the "dependencies".

Specifying these dependencies is error-prone, and if something fails to
get rebuilt when it should, it often results in hard-to-debug problems.

Interposition in make, with Vesta
---------------------------------

Vesta (http://www.vestasys.org/) is a free-software system for, among
other things, replacing make.  It uses "filesystem encapsulation" to make
sure it has dependencies correct; its approach to tracking dependencies
(http://www.vestasys.org/doc/sdl-ref/walkthrough.html#RunTool) is to
run the compiler (and linker and whatnot) in a chroot in which the only
filesystem available is one provided by Vesta itself.  This means that
it can detect every file the compiler reads, and they're all checked
into its version-tracking repository, so it can reliably tell whether
rerunning a build step would produce the same results as before ---
in which case, it can safely use the previous results.

(Vesta's replacement for make isn't compatible with Makefiles, using its
own syntax and even terminology, and it only works if you store your
files in Vesta's version-tracking repository.  I suspect these may be
reasons Vesta isn't more widely used.  Perhaps someone should create a
version of make that works the same way.)

Interposition in web-page compositing
-------------------------------------

It seems to me that you could use the Vesta strategy with a web-page
compositing system, integrated into a caching proxy HTTP server.
If all the inputs to your web-page generating script are fetched by HTTP
through this proxy, then the proxy can know what they are, and serve
cached output if they haven't changed.  It can measure the CPU cost of
running each script, and only cache output when the time saved is worth
the space used.  It could even run some scripts periodically (if they
take a while to run, and their results are likely to be used) in order
to ensure that their results are available promptly when needed.

You'd have to make sure that you didn't forget any inputs, though --- if you
used any inputs that weren't handled automatically (like the filesystem,
perhaps, or the system clock) you'd want to have a way to handle those
out-of-band.

Common components
-----------------

A caching proxy caches HTTP results at the level of entire
resources (well, their representations).  If you write a
script that depends on eBay's "Toshiba Laptops, 500-666MHz" page
(http://computers.listings.ebay.com/Toshiba_500-666MHz_W0QQfromZR4QQsacatZ31561QQsocmdZListingItemList),
in order to inform you when a new Tecra 8100 is going on sale, you
might not want to rerun your script every time the banner ad on that
page changes.  Instead, you could split it into two parts: one that
executes some query on that page (maybe "<tr> containing "8100" and not
containing <tr>") and makes the results available at some other URL,
and a heavier-weight script that uses the stripped-down page at this
other URL.  Then your heavier-weight script would only run when the part
of the page it cared about had changed.

This lightweight part-extractor component will be useful in many
applications, since often you only care about a part of the pages you
depend on, and it can make this fact visible to the execution environment.

An analogous component is one that combines several resources into one,
perhaps so that you can extract parts from the combination more easily,
or perhaps so that you can display the combination itself as your
"portal" page.  By separating out this "compositing" or "templating"
function as a separate task, you allow the various pieces to be recomputed
(and possibly cached) independently.

A third common component, which might subsume the task of the first two,
is an XSLT transformer.

The Atom format, which is normally used for sequences of updates, has
several potentially interesting components that can live within this
framework.  Specifically, it has a "truncator" that drops all but the
most recent N events from a given feed; an "aggregator" that takes
several Atom feeds and interleaves their posts according to time; and
an "encapsulator" or "watcher" that takes an entire web page and
encapsulates it as a single Atom item in a one-item-long feed.

Other features of this execution environment
--------------------------------------------

In addition to network access, input caching, output caching, CPU
measurement, and speculative pre-caching (this last I haven't
discussed), the following features would fit in nicely:
- special handling of Cache-control: max-age (and Age: of course.).  Scripts
  inside the system could use this to prevent the proxy from polling an origin
  server too often, and normally it would propagate from a request for a
  composited page to the requests used to fetch the components of that page.
- invalidation propagation.  Speculative pre-caching, if it produces a new
  representation for an existing resource, may invalidate other cached
  representations that were computed from that one.  (HTTP might not consider
  those representations "stale", but they no longer reflect the current state
  of the resources they depend on.)  The execution environment should propagate
  the cache invalidation.  In some cases, it may be wise to speculatively
  recompute those other resources as well.  (All of this might also occur while
  a request is being processed, but it would be wise to defer any speculative
  recomputation until the response is finished.)
- special HTTP error handling.  In many cases, it's better to get an
  out-of-date cached representation for a resource than an error
  message, and any caching proxy would be in a position to provide this
  benefit (although I don't know of any that do).  In other cases, you
  might want to propagate e.g. 500 and 404 errors through automatically,
  rather than having special-case code in your service that handles
  this.  (Perhaps that's better to put in a library rather than an
  execution framework, though.)

The choice of which representations to cache cannot be solved locally, since
the benefit of caching something depends on the cost of computing it.
Something that makes sense to cache because it depends on some uncached and
expensive computation might cease to make sense to cache once that uncached
computation gets cached itself.  It turns out that this is related to
well-known problems in OLAP.  See J. Ullman, "Efficient Implementation of Data
Cubes via Materialized Views", http://dbpubs.stanford.edu/pub/1996-32.

Given limited resources, the optimal frequency to speculatively recompute
something (whether it's as simple as a cached copy of a web page, or
something much more CPU-intensive) does not increase monotonically as
the update rate of its dependents increases.  At some point, it is no
longer worth it to re-fetch web pages more often in hope that they will be
up-to-date when they are needed.  See Arasu, Cho, Garcia-Molina, Paepcke,
Raghavan, "Searching the Web", http://dbpubs.stanford.edu/pub/2000-37.

Other systems complementary to this dependency-driven execution environment
---------------------------------------------------------------------------

Without a protocol for pushing cache invalidations across the network,
the above-described cache-invalidation protocol only works inside a
single server.  Given a cache-invalidation protocol for HTTP resources
(for example, the one suggested in my zLab note, "Synchronizing REST
Resources With Asynchronous Notification: A Practical Proposal", which
presently resides at
http://wiki.commerce.net/wiki/Synchronizing_REST_Resources_With_Asynchronous_Notification:_A_Practical_Proposal),
it is possible to prevent any composited page from being out of date
except when the network is unreliable or under conditions of excessive
load.

Such a protocol could be added to existing web sites with a program we
at CommerceNet have been calling a "delta proxy": a reverse proxy that
sits in front of your web server and notices when resources'
representations change.  It is then in a position to provide various
kinds of change notifications, such as a Google "sitemap" file, RSS and
Atom feeds, or pushed cache invalidations.

With Ajax, you can actually push the update notification all the way
to the web browser, updating information on somebody's screen in real
time.  (The current suite of Ajax tools don't support asynchronous
event delivery to the browser --- they don't hold any connection open
to the server --- but as we showed at KnowNow, it is possible to do it
with all modern browsers with some difficulty.)

Many applications you might want to build would require some
historical information --- for example, if you have a web page that
shows the current temperature in Portland, Oregon, you might want to
graph it over time.

Given the previously-mentioned "watcher" component that encapsulates a
web page as an Atom feed, there's one additional component that
doesn't quite fit into this framework that can enable such
applications.  It's a "feed extender", which turns a shorter Atom feed
(such as a one-item feed) into a longer feed, with up to N items, by
remembering items it saw in the past and inserting "new" items at the
beginning of the feed.

There's also the question of how to avoid scaling problems in
publishing of resources like this.  I think we can build a
decentralized network that gives you a few orders of magnitude
improvement here; I've written about this previously at
http://wiki.commerce.net/wiki/Point_to_Point_Multicast_Atom_Network
If you supplement the system described there with cryptographic
signatures with serial numbers, with key hashes included in the URL, you
can avoid the necessity for anyone to hit the origin server, so there
need not be a publicly-accessible origin server any more.

Internal State as Cached Computations on I/O History
----------------------------------------------------

Several formalisms describe a program as a function from an input
history to an output action; my own "rumor-oriented programming" idea
<http://lists.canonical.org/pipermail/kragen-tol/2004-January/000749.html>
depends on this.  If you map your input history into URL space as REST
resources, then your program can be entirely stateless --- all its state
can live out in the web, with this caching proxy to save it from taking
a long time to do anything at all.

Of course, you don't want *any* change to the input history to
invalidate *all* your previous computations.  So you structure your
program in parts.  Suppose, for concreteness, you have a blogging
application whose input history consists of HTTP form POSTs to some URL.
You have one resource that lists all the past form POSTs as links, in
order, and one resource for each form POST.  Your blog-front-page
program looks at this form POST index and issues a request for each
group of ten form POSTs, containing their URLs, to a summarization
script, which returns the dates of the blog posts contained in those
form POSTs.  Then the front-page program decides which ten or fifteen
form POSTs are relevant to displaying the front page, and fetches them
directly.  Although the front-page program's cached results are
invalidated whenever a new POST is made, the summarization script's
results are not, and most of the computation (trawling through massive
amounts of text and deciding which posts are authorized and which
aren't, for instance) is in the summarization script.

In general you can turn a computation on your entire input history into
more-cacheable computations on subsets of it in this way.  It may be
difficult to fit the specifications for these computations into a URL,
though.

Some apps depend on the past representations of some other
URL-addressable resource instead of explicit user actions.  These apps
can be wrapped up into this framework with the Atom "watcher" and "feed
extender" components discussed above.

Sample app outline: eBay querier
--------------------------------

Suppose you want to follow the current market price of Toshiba Tecra
8100s on eBay over time.  eBay used to allow easy searching of
recently-completed auctions, which gave you an easy way to get the
current market price: look for successfully-completed auctions of recent
similar items and use the final selling price.  "Buy it now!" added a
confounding factor, since it allowed a bad seller to depress the
apparent market price, but you could ignore the "Buy it now!" hits.

But first they moved the "search completed items" option off the main
search page onto search-completed.ebay.com.  Now even that requires that
you log into your eBay account, so they can track who's searching for
what.

My approach now is to do a current search, find the non-buy-it-now
auctions ending soonest, bookmark the links to those auctions, and
follow those links after the auctions are scheduled to end.  This is a
little laborious, and it would be nice to automate.

So you establish a "watcher" and a "feed extender" with a polling
interval of, say, an hour, on some relevant eBay category page.  Then
you write a script which looks at the "extended feed" and, given a
search string, finds items in a recent version of the page matching the
string that are not there in the current version of the page.  Then it
fetches the items' individual auction pages and extracts the number of
bids and the closing price, and returns a table of these with the item
descriptions and links.

Sample app outline: webmail viewer
----------------------------------

How could you build an application like Squirrelmail or Hotmail in this
framework?  (This relates to the application area I had in mind in
"rumor-oriented programming".)

(This is somewhat oriented to my own style of email usage, in which all
the email I have ever received is a database of useful facts and
ideas, but I rarely read new email often than twice a week.  Apparently
for most people it's a sort of reliable work queuing system, which they
use as a sort of to-do list, and delete items once they're done.)

You map the history of incoming email messages into URL space: perhaps a
web page consisting of links to all the incoming messages for a
particular account for a particular month, and another "master index"
web page listing the total number of messages in each month, with links
to the individual-month pages.  These pages are not accessible to
everyone, perhaps by virtue of having unguessable URLs that never leak
outside the system.

This email-history necessarily lives outside the framework I've been
discussing, since it has internal persistent state.  It might store the
messages in MySQL, with a thin layer of Perl or Python on top to render
incremental indices in the manner described above.

This allows you to display read-only views of recent messages; to
optimize this, you might want "most recent 100 messages", "most recent
1000 messages", and "most recent 10000 messages" resources that can be
cached internally, computed from the master index and some number of
individual month pages; and these summaries can include information
commonly used for search, such as From, To, Subject, and References.

If you want full-text search, you can compute inverted-index resources
per month.  (There's no point in creating per-year inverted indices,
because they would depend on the master index, and so they would become
invalid any time you received a new mail message.)  You might cache a
merged index for all months except the current one, by encoding the list
of months in the URL, and use that index (usually cached) plus the
current-month index (fast to recompute) for full-text searches.

Of course, you'd probably prefer not to have the full-text searcher
"download" the entire inverted index for every search, even from a local
cache.  But you can map arbitrarily small parts of it into URL space:
words beginning with 'i', words beginning with 'in', words beginning
with 'ind', etc.  Of course, the thing that extracts that small part
must "download" the entire inverted index, but not very often.

To provide deletion and filing in different folders, you need to store a
history of user actions, i.e. form posts.  Then you can compute
resources such as "deletions from folder INBOX", "deletions from folder
Work", "filings into folder Work", etc., from this history of user
actions, and the display of the folder Work might depend on the "filings
into folder Work" and "deletions from folder Work" resources, plus some
random collection of incoming message resources.  

Suppose you want to subdivide "deletions from folder INBOX" into smaller
subsets that pertain to individual months of incoming messages, so that
you can look at lists of messages you received (but didn't delete) last
year, and not have them take a long time to display.  So you have a
"deletions from folder INBOX of messages originally received in
November 2003" resource, which depends on "deletions from folder INBOX"
as well as "messages received in November 2003".  But "deletions from
folder INBOX" changes fairly often, and whenever it does, it invalidates
"deletions from folder INBOX, November 2003".  Recalculating that
resource's representation might show that it didn't change, so there's
no need to recalculate other things that depend on it ("display of INBOX
messages received in November 2003").  But, since "deletions from INBOX
Nov 2003" involves reading all of "deletions from INBOX", it's probably
more expensive to recompute anyway.

Of course, user actions that have some effect outside this system, such
as sending email to someone outside the system, don't fit inside this
framework; you need some external "gateway" program for that.

Sample app outline: custom portal
---------------------------------

You want a web page that tells you the top news stories, the prices on
your favorite stocks, the current value of your stock portfolio,
headlines from BBC News, and your current to-do list items (pulled from
some external system like Tadalist.)

If you have the XPath-extractor service described earlier, you can
describe this almost entirely in terms of a sequence of URLs whose
<body> contents should be concatenated.  (You might need an XSLT
transformer for the RSS feeds.)  Then you can put this sequence of URLs
on a page on your personal web site and pass its URL to the "custom
portal app" running inside this dependency-driven recomputation
environment; it reads your list of URLs and fetches a resource for each
URL, some of which (the XPath-extractor, perhaps) are inside the DDR
system and depend on external URLs.  External URLs are cached and
fetched rarely, and the compositing program doesn't have to think about
any of it.

Complementary infrastructure
----------------------------

I've mentioned a few services above that make this system more useful,
while standing outside the strict dependency-driven evaluation graph;
Here's a list of the things I've thought of:
- the Atom "feed extender" listed above, which provides a window into
  the past;
- the form-POST logger occasionally mentioned above; probably this will
  be most useful if its URL is secret or otherwise not globally
  accessible.
- URL shorteners, along the lines of makeashorterlink.com, provide four
  useful services in this context:
  - if your overall tree of dependencies is large, you may have a hard
    time fitting it into a single URL query string (containing encoded
    URLs containing their own query strings pointing at still other
    URLs).  A URL shortening service can keep the URL lengths under
    control.
  - if you can change the URLs they point to, they serve the function of
    mutable variables.
  - if they proxy for URLs rather than redirecting to them, they can
    provide revocability for access to secret capability URLs.
  - if they can take a query string with named parameters and combine
    that with any original query string, they provide currying.
- a user interface that appreciates that URLs can take other URLs as
  parameters; see
  <http://lists.canonical.org/pipermail/kragen-tol/2005-September/000793.html>
  "Currying as a UI technique for devolution of power to users" for my
  thoughts on that, particularly the scenario entitled "An Example
  Application" and the explanatory text in the section "A REST-backed
  drag-and-drop UI".
- a web-wide invalidation protocol, like that described in the zLab note
  mentioned above.

Holes
-----

The inverted-index case points out that, in some cases, one small
program can compute a large number of resources ("words beginning with
a; words beginning with b; etc.") with no more work than it needs to
compute one of them.

As illustrated by the "INBOX deletions, Nov 2003" example above, because
there's no way to store information inside the system, there's no way to
deal with externally provided resources that are too large in
granularity and change frequently.  You can cache smaller resources that
contain subsets of them, but revalidating those cached subsets involves
rereading the entire large resource.

The idea that a dependent resource might not actually change when the
thing it depends on changes poses a couple of problems:
- it casts doubt on the idea that a cost-based optimization algorithm
  can a priori figure out a good set of things to cache
- it interferes with eager invalidation, where the invalidations flow
  forward along the data flow edges without recomputing any of the
  intermediate values unless the invalidations reach something that's
  currently being observed.  ("eager invalidation" is the lazy
  evaluation scheme described in
  <http://lists.canonical.org/pipermail/kragen-tol/2005-May/000778.html>
  "lazy evaluation in a distributed system", and
  <http://lists.canonical.org/pipermail/kragen-hacks/2005-June/000412.html>
  "lazy evaluation in Python in a way that supports distributed
  operation", and
  <http://lists.canonical.org/pipermail/kragen-hacks/2004-June/000399.html>,
  "dynamically updating dependency networks in Smalltalk".)  I think
  it's possible to reconcile the two schemes by having three states per
  cached resource (valid, possibly invalid, invalid) instead of just
  two, but I haven't done the work to find out.

Credits
-------

The ideas in here were inspired by work by, and were honed in
discussions with, Dan Moniz, Ben Sittler, Rohit Khare, Allan Schiffman,
Donovan Preston, and Mark Lentczner.

Make for URLs, or dependency-directed compositing

Reply via email to