At 07:13 PM 7/9/2007 +0400, René Dudfield wrote: >The way to do this atomically, so not one can possibly get an old >page, the static file will be removed as the change is committed. >Then everyone gets the latest change right away - as soon as the >change has been committed.
This sounds pretty good... except that you may need better protection against a race condition. What happens if a page is removed *while* it is being regenerated? PostgreSQL has MVCC for read-only transactions, so the static page will be generated against old data, unless you have some other locking mechanism used to serialize access to the static file, that is shared by both the deletion and generating mechanisms. One possible approach: if the generator writes its files to foo/index.html.tmp (opened with exclusive access) and then renames them to 'foo/index.html', then the deletion mechanism can attempt to *first* remove the .tmp file, then the real file. Both processes must be robust against their renames or unlinks or exclusive open()'s failing, but there would then be no possibility of collision. The exclusive open would have to be done at the *start* of write processing, however, before any database queries have been attempted. (And their connection must be rolled back at that point.) This ensures that, if a writer succeeds in locking the .tmp file, then they are seeing data that is current. All that having been said, the idea in general sounds good. If PyPI itself simply checked whether the URL it's about to serve is cacheable (i.e., has a static location and no user logged in), and if so, opened the temp file for exclusive writing, it could just dump its generated page out, and rename it at the end if it had been successful in acquiring the temp file. And voila! No separate caching process, no scheduling, and an always perfectly-up-to-date cache. As soon as a page becomes out of date, it gets served dynamically... but only for as long as it takes to serve one copy of that page. :) In pseudocode: def process_request(): if no authentication header and URL path is cacheable: try: temp = exclusive open cache file with .tmp extension except os.error: pass else: with stdout redirected to temp: process_request_normally() try: rename(tempfilename, realfilename) except os.error: pass send_browser_contents_of(temp) return return process_request_normally() Here, 'process_request_normally()' should refer to everything that PyPI does now, *including database connection rollback or commit*. This will ensure that it's impossible to write stale data to the cache. The deletion process should just do this: for name in (cache_path+'.tmp', cache_path): try: os.unlink(name) except os.error: pass after committing the database transaction. Informal serialization proof: * Only one process may write to a page's .tmp file at a time * Either the writer has committed its page write (by renaming the .tmp file), or it has not (i.e., rename() is atomic) * If the writer has *not* committed its page, then the first unlink will prevent it from doing so. * If the writer *has* committed its page, then the second unlink will undo this. * If between the two unlinks operations, another writer appears, that writer will be reading current data from the database, because it has to acquire exclusive access to the .tmp file before doing a rollback and reading the data it will use for writing. QED, it will be impossible to have stale data in the cache, unless the invalidating request fails to attempt its two unlink operations during the brief window after its database commit. _______________________________________________ Catalog-SIG mailing list Catalog-SIG@python.org http://mail.python.org/mailman/listinfo/catalog-sig