I've been preparing a static pull of wiki.laptop.org to send to bandwidth-challenged regions, as well as to use as a failover in case of high load.
It's basically a simple: wget -EkKm http://wiki.laptop.org of the site. Interesting fact: the root directory contains 1,061,633 separate files, and an 'ls' of that directory takes 9m24s. This is an ext3-formatted partition. Repeating the ls takes only 10s; linux's dcache is a marvel. apache seems to perform reasonably well serving files from such huge directories. Should I be concerned? Can anyone suggest: a) a patched wget or a tool other than wget which would fabricate appropriate directory structure to prevent everything from being thrown together in the root or /go/ directories? b) whether reformatting with reiserfs or some other filesystem is worth the trouble? ext3 already has btree-structured directories, so reiserfs isn't quite the obvious win it used to be. c) patched wget or other tool that will actually honor robot exclusion directives in <meta> tags in page headers? wget seems to honor 'nofollow', but mediawiki uses <meta name="robots" content="noindex,nofollow" /> in the <head> of edit and printable pages, which isn't sufficient to convince wget to delete the file it just downloaded. We really don't need those pages in the static pull; they just bloat our directories. I could rig a find script after the fact, but I'd prefer not to have to go through the stage of having a bazillion files in the directory before it's cleaned up. I also tweaked the language settings on the wiki slightly, which should reduce the number of files by a factor of 5 or so, by suppressing the &setlang=<LANG> links in the side bar; the existing "In other languages" links are preferable for this purpose. But maybe the combined wisdom of devel@ can suggest other things I could be trying. --scott -- ( http://cscott.net/ ) _______________________________________________ Devel mailing list [email protected] http://lists.laptop.org/listinfo/devel
