(3800 words, terribly sorry) Lots of web pages are computed dynamically from some set of resources. For example, a blog's front page is computed dynamically from a (potentially large) set of blog posts. The display of an RSS aggregator, such as the LiveJournal friends page, is computed dynamically from a subscription list and a bunch of friends' blogs. Many web pages use some kind of templating system, so the page consists at least of "content" and the template.
Performance problems in web page compositing -------------------------------------------- These applications have a set of performance problems. Retrieving the resources they depend on can take time, use bandwidth, and cost other people money; dynamically generating the web pages themselves uses CPU time; and all of this often happens while a human being is impatiently waiting for a response. Dependency-directed software builds: make ----------------------------------------- Normally people write software as a human-readable "source code file" which gets "compiled" (by a slow program) into an "object file" containing "machine code", which then gets "linked" with other object files (by a fast program) into an "executable" which they can run. When they make changes to the source code, they often want to see the results of their changes, and they have to rerun this whole process. But typically, they haven't changed all the source files, so they only have to recompile some of the source files. There's a widely used program called "make" that figures out which object files are out of date and need to be recompiled from source. We write a "Makefile" which gives the command to build each object file, and the command to link the final executable, along with a list of source files that go into that object file (or object files that go into that executable.) If any of these "dependencies" are newer than the "target", then "make" reruns the command, which "rebuilds" the "target" from the data in the "dependencies". Specifying these dependencies is error-prone, and if something fails to get rebuilt when it should, it often results in hard-to-debug problems. Interposition in make, with Vesta --------------------------------- Vesta (http://www.vestasys.org/) is a free-software system for, among other things, replacing make. It uses "filesystem encapsulation" to make sure it has dependencies correct; its approach to tracking dependencies (http://www.vestasys.org/doc/sdl-ref/walkthrough.html#RunTool) is to run the compiler (and linker and whatnot) in a chroot in which the only filesystem available is one provided by Vesta itself. This means that it can detect every file the compiler reads, and they're all checked into its version-tracking repository, so it can reliably tell whether rerunning a build step would produce the same results as before --- in which case, it can safely use the previous results. (Vesta's replacement for make isn't compatible with Makefiles, using its own syntax and even terminology, and it only works if you store your files in Vesta's version-tracking repository. I suspect these may be reasons Vesta isn't more widely used. Perhaps someone should create a version of make that works the same way.) Interposition in web-page compositing ------------------------------------- It seems to me that you could use the Vesta strategy with a web-page compositing system, integrated into a caching proxy HTTP server. If all the inputs to your web-page generating script are fetched by HTTP through this proxy, then the proxy can know what they are, and serve cached output if they haven't changed. It can measure the CPU cost of running each script, and only cache output when the time saved is worth the space used. It could even run some scripts periodically (if they take a while to run, and their results are likely to be used) in order to ensure that their results are available promptly when needed. You'd have to make sure that you didn't forget any inputs, though --- if you used any inputs that weren't handled automatically (like the filesystem, perhaps, or the system clock) you'd want to have a way to handle those out-of-band. Common components ----------------- A caching proxy caches HTTP results at the level of entire resources (well, their representations). If you write a script that depends on eBay's "Toshiba Laptops, 500-666MHz" page (http://computers.listings.ebay.com/Toshiba_500-666MHz_W0QQfromZR4QQsacatZ31561QQsocmdZListingItemList), in order to inform you when a new Tecra 8100 is going on sale, you might not want to rerun your script every time the banner ad on that page changes. Instead, you could split it into two parts: one that executes some query on that page (maybe "<tr> containing "8100" and not containing <tr>") and makes the results available at some other URL, and a heavier-weight script that uses the stripped-down page at this other URL. Then your heavier-weight script would only run when the part of the page it cared about had changed. This lightweight part-extractor component will be useful in many applications, since often you only care about a part of the pages you depend on, and it can make this fact visible to the execution environment. An analogous component is one that combines several resources into one, perhaps so that you can extract parts from the combination more easily, or perhaps so that you can display the combination itself as your "portal" page. By separating out this "compositing" or "templating" function as a separate task, you allow the various pieces to be recomputed (and possibly cached) independently. A third common component, which might subsume the task of the first two, is an XSLT transformer. The Atom format, which is normally used for sequences of updates, has several potentially interesting components that can live within this framework. Specifically, it has a "truncator" that drops all but the most recent N events from a given feed; an "aggregator" that takes several Atom feeds and interleaves their posts according to time; and an "encapsulator" or "watcher" that takes an entire web page and encapsulates it as a single Atom item in a one-item-long feed. Other features of this execution environment -------------------------------------------- In addition to network access, input caching, output caching, CPU measurement, and speculative pre-caching (this last I haven't discussed), the following features would fit in nicely: - special handling of Cache-control: max-age (and Age: of course.). Scripts inside the system could use this to prevent the proxy from polling an origin server too often, and normally it would propagate from a request for a composited page to the requests used to fetch the components of that page. - invalidation propagation. Speculative pre-caching, if it produces a new representation for an existing resource, may invalidate other cached representations that were computed from that one. (HTTP might not consider those representations "stale", but they no longer reflect the current state of the resources they depend on.) The execution environment should propagate the cache invalidation. In some cases, it may be wise to speculatively recompute those other resources as well. (All of this might also occur while a request is being processed, but it would be wise to defer any speculative recomputation until the response is finished.) - special HTTP error handling. In many cases, it's better to get an out-of-date cached representation for a resource than an error message, and any caching proxy would be in a position to provide this benefit (although I don't know of any that do). In other cases, you might want to propagate e.g. 500 and 404 errors through automatically, rather than having special-case code in your service that handles this. (Perhaps that's better to put in a library rather than an execution framework, though.) The choice of which representations to cache cannot be solved locally, since the benefit of caching something depends on the cost of computing it. Something that makes sense to cache because it depends on some uncached and expensive computation might cease to make sense to cache once that uncached computation gets cached itself. It turns out that this is related to well-known problems in OLAP. See J. Ullman, "Efficient Implementation of Data Cubes via Materialized Views", http://dbpubs.stanford.edu/pub/1996-32. Given limited resources, the optimal frequency to speculatively recompute something (whether it's as simple as a cached copy of a web page, or something much more CPU-intensive) does not increase monotonically as the update rate of its dependents increases. At some point, it is no longer worth it to re-fetch web pages more often in hope that they will be up-to-date when they are needed. See Arasu, Cho, Garcia-Molina, Paepcke, Raghavan, "Searching the Web", http://dbpubs.stanford.edu/pub/2000-37. Other systems complementary to this dependency-driven execution environment --------------------------------------------------------------------------- Without a protocol for pushing cache invalidations across the network, the above-described cache-invalidation protocol only works inside a single server. Given a cache-invalidation protocol for HTTP resources (for example, the one suggested in my zLab note, "Synchronizing REST Resources With Asynchronous Notification: A Practical Proposal", which presently resides at http://wiki.commerce.net/wiki/Synchronizing_REST_Resources_With_Asynchronous_Notification:_A_Practical_Proposal), it is possible to prevent any composited page from being out of date except when the network is unreliable or under conditions of excessive load. Such a protocol could be added to existing web sites with a program we at CommerceNet have been calling a "delta proxy": a reverse proxy that sits in front of your web server and notices when resources' representations change. It is then in a position to provide various kinds of change notifications, such as a Google "sitemap" file, RSS and Atom feeds, or pushed cache invalidations. With Ajax, you can actually push the update notification all the way to the web browser, updating information on somebody's screen in real time. (The current suite of Ajax tools don't support asynchronous event delivery to the browser --- they don't hold any connection open to the server --- but as we showed at KnowNow, it is possible to do it with all modern browsers with some difficulty.) Many applications you might want to build would require some historical information --- for example, if you have a web page that shows the current temperature in Portland, Oregon, you might want to graph it over time. Given the previously-mentioned "watcher" component that encapsulates a web page as an Atom feed, there's one additional component that doesn't quite fit into this framework that can enable such applications. It's a "feed extender", which turns a shorter Atom feed (such as a one-item feed) into a longer feed, with up to N items, by remembering items it saw in the past and inserting "new" items at the beginning of the feed. There's also the question of how to avoid scaling problems in publishing of resources like this. I think we can build a decentralized network that gives you a few orders of magnitude improvement here; I've written about this previously at http://wiki.commerce.net/wiki/Point_to_Point_Multicast_Atom_Network If you supplement the system described there with cryptographic signatures with serial numbers, with key hashes included in the URL, you can avoid the necessity for anyone to hit the origin server, so there need not be a publicly-accessible origin server any more. Internal State as Cached Computations on I/O History ---------------------------------------------------- Several formalisms describe a program as a function from an input history to an output action; my own "rumor-oriented programming" idea <http://lists.canonical.org/pipermail/kragen-tol/2004-January/000749.html> depends on this. If you map your input history into URL space as REST resources, then your program can be entirely stateless --- all its state can live out in the web, with this caching proxy to save it from taking a long time to do anything at all. Of course, you don't want *any* change to the input history to invalidate *all* your previous computations. So you structure your program in parts. Suppose, for concreteness, you have a blogging application whose input history consists of HTTP form POSTs to some URL. You have one resource that lists all the past form POSTs as links, in order, and one resource for each form POST. Your blog-front-page program looks at this form POST index and issues a request for each group of ten form POSTs, containing their URLs, to a summarization script, which returns the dates of the blog posts contained in those form POSTs. Then the front-page program decides which ten or fifteen form POSTs are relevant to displaying the front page, and fetches them directly. Although the front-page program's cached results are invalidated whenever a new POST is made, the summarization script's results are not, and most of the computation (trawling through massive amounts of text and deciding which posts are authorized and which aren't, for instance) is in the summarization script. In general you can turn a computation on your entire input history into more-cacheable computations on subsets of it in this way. It may be difficult to fit the specifications for these computations into a URL, though. Some apps depend on the past representations of some other URL-addressable resource instead of explicit user actions. These apps can be wrapped up into this framework with the Atom "watcher" and "feed extender" components discussed above. Sample app outline: eBay querier -------------------------------- Suppose you want to follow the current market price of Toshiba Tecra 8100s on eBay over time. eBay used to allow easy searching of recently-completed auctions, which gave you an easy way to get the current market price: look for successfully-completed auctions of recent similar items and use the final selling price. "Buy it now!" added a confounding factor, since it allowed a bad seller to depress the apparent market price, but you could ignore the "Buy it now!" hits. But first they moved the "search completed items" option off the main search page onto search-completed.ebay.com. Now even that requires that you log into your eBay account, so they can track who's searching for what. My approach now is to do a current search, find the non-buy-it-now auctions ending soonest, bookmark the links to those auctions, and follow those links after the auctions are scheduled to end. This is a little laborious, and it would be nice to automate. So you establish a "watcher" and a "feed extender" with a polling interval of, say, an hour, on some relevant eBay category page. Then you write a script which looks at the "extended feed" and, given a search string, finds items in a recent version of the page matching the string that are not there in the current version of the page. Then it fetches the items' individual auction pages and extracts the number of bids and the closing price, and returns a table of these with the item descriptions and links. Sample app outline: webmail viewer ---------------------------------- How could you build an application like Squirrelmail or Hotmail in this framework? (This relates to the application area I had in mind in "rumor-oriented programming".) (This is somewhat oriented to my own style of email usage, in which all the email I have ever received is a database of useful facts and ideas, but I rarely read new email often than twice a week. Apparently for most people it's a sort of reliable work queuing system, which they use as a sort of to-do list, and delete items once they're done.) You map the history of incoming email messages into URL space: perhaps a web page consisting of links to all the incoming messages for a particular account for a particular month, and another "master index" web page listing the total number of messages in each month, with links to the individual-month pages. These pages are not accessible to everyone, perhaps by virtue of having unguessable URLs that never leak outside the system. This email-history necessarily lives outside the framework I've been discussing, since it has internal persistent state. It might store the messages in MySQL, with a thin layer of Perl or Python on top to render incremental indices in the manner described above. This allows you to display read-only views of recent messages; to optimize this, you might want "most recent 100 messages", "most recent 1000 messages", and "most recent 10000 messages" resources that can be cached internally, computed from the master index and some number of individual month pages; and these summaries can include information commonly used for search, such as From, To, Subject, and References. If you want full-text search, you can compute inverted-index resources per month. (There's no point in creating per-year inverted indices, because they would depend on the master index, and so they would become invalid any time you received a new mail message.) You might cache a merged index for all months except the current one, by encoding the list of months in the URL, and use that index (usually cached) plus the current-month index (fast to recompute) for full-text searches. Of course, you'd probably prefer not to have the full-text searcher "download" the entire inverted index for every search, even from a local cache. But you can map arbitrarily small parts of it into URL space: words beginning with 'i', words beginning with 'in', words beginning with 'ind', etc. Of course, the thing that extracts that small part must "download" the entire inverted index, but not very often. To provide deletion and filing in different folders, you need to store a history of user actions, i.e. form posts. Then you can compute resources such as "deletions from folder INBOX", "deletions from folder Work", "filings into folder Work", etc., from this history of user actions, and the display of the folder Work might depend on the "filings into folder Work" and "deletions from folder Work" resources, plus some random collection of incoming message resources. Suppose you want to subdivide "deletions from folder INBOX" into smaller subsets that pertain to individual months of incoming messages, so that you can look at lists of messages you received (but didn't delete) last year, and not have them take a long time to display. So you have a "deletions from folder INBOX of messages originally received in November 2003" resource, which depends on "deletions from folder INBOX" as well as "messages received in November 2003". But "deletions from folder INBOX" changes fairly often, and whenever it does, it invalidates "deletions from folder INBOX, November 2003". Recalculating that resource's representation might show that it didn't change, so there's no need to recalculate other things that depend on it ("display of INBOX messages received in November 2003"). But, since "deletions from INBOX Nov 2003" involves reading all of "deletions from INBOX", it's probably more expensive to recompute anyway. Of course, user actions that have some effect outside this system, such as sending email to someone outside the system, don't fit inside this framework; you need some external "gateway" program for that. Sample app outline: custom portal --------------------------------- You want a web page that tells you the top news stories, the prices on your favorite stocks, the current value of your stock portfolio, headlines from BBC News, and your current to-do list items (pulled from some external system like Tadalist.) If you have the XPath-extractor service described earlier, you can describe this almost entirely in terms of a sequence of URLs whose <body> contents should be concatenated. (You might need an XSLT transformer for the RSS feeds.) Then you can put this sequence of URLs on a page on your personal web site and pass its URL to the "custom portal app" running inside this dependency-driven recomputation environment; it reads your list of URLs and fetches a resource for each URL, some of which (the XPath-extractor, perhaps) are inside the DDR system and depend on external URLs. External URLs are cached and fetched rarely, and the compositing program doesn't have to think about any of it. Complementary infrastructure ---------------------------- I've mentioned a few services above that make this system more useful, while standing outside the strict dependency-driven evaluation graph; Here's a list of the things I've thought of: - the Atom "feed extender" listed above, which provides a window into the past; - the form-POST logger occasionally mentioned above; probably this will be most useful if its URL is secret or otherwise not globally accessible. - URL shorteners, along the lines of makeashorterlink.com, provide four useful services in this context: - if your overall tree of dependencies is large, you may have a hard time fitting it into a single URL query string (containing encoded URLs containing their own query strings pointing at still other URLs). A URL shortening service can keep the URL lengths under control. - if you can change the URLs they point to, they serve the function of mutable variables. - if they proxy for URLs rather than redirecting to them, they can provide revocability for access to secret capability URLs. - if they can take a query string with named parameters and combine that with any original query string, they provide currying. - a user interface that appreciates that URLs can take other URLs as parameters; see <http://lists.canonical.org/pipermail/kragen-tol/2005-September/000793.html> "Currying as a UI technique for devolution of power to users" for my thoughts on that, particularly the scenario entitled "An Example Application" and the explanatory text in the section "A REST-backed drag-and-drop UI". - a web-wide invalidation protocol, like that described in the zLab note mentioned above. Holes ----- The inverted-index case points out that, in some cases, one small program can compute a large number of resources ("words beginning with a; words beginning with b; etc.") with no more work than it needs to compute one of them. As illustrated by the "INBOX deletions, Nov 2003" example above, because there's no way to store information inside the system, there's no way to deal with externally provided resources that are too large in granularity and change frequently. You can cache smaller resources that contain subsets of them, but revalidating those cached subsets involves rereading the entire large resource. The idea that a dependent resource might not actually change when the thing it depends on changes poses a couple of problems: - it casts doubt on the idea that a cost-based optimization algorithm can a priori figure out a good set of things to cache - it interferes with eager invalidation, where the invalidations flow forward along the data flow edges without recomputing any of the intermediate values unless the invalidations reach something that's currently being observed. ("eager invalidation" is the lazy evaluation scheme described in <http://lists.canonical.org/pipermail/kragen-tol/2005-May/000778.html> "lazy evaluation in a distributed system", and <http://lists.canonical.org/pipermail/kragen-hacks/2005-June/000412.html> "lazy evaluation in Python in a way that supports distributed operation", and <http://lists.canonical.org/pipermail/kragen-hacks/2004-June/000399.html>, "dynamically updating dependency networks in Smalltalk".) I think it's possible to reconcile the two schemes by having three states per cached resource (valid, possibly invalid, invalid) instead of just two, but I haven't done the work to find out. Credits ------- The ideas in here were inspired by work by, and were honed in discussions with, Dan Moniz, Ben Sittler, Rohit Khare, Allan Schiffman, Donovan Preston, and Mark Lentczner.

