Mostly a fairly random collection of notes. I'm using WWWOFFLE 2.8d now for offline web browsing. It's a proxy server that saves all the pages you ask for, and in offline mode, returns you an ugly page that tells you it remembered your request. Then, when you go online and give it the "fetch" command, it downloads all the pages you previously asked for, plus all their inline images, stylesheets, Flash files, and so on.
The main differences between WWWOFFLE and a normal caching proxy server: - it caches stuff longer than the HTTP spec says it should, and in offline mode, it serves it up when you ask for it (in lieu of an error page) even if the HTTP spec says it's stale; - it remembers requests it couldn't fulfill in order to fulfill them later. It also has a fairly kickass recursive-get mode. It actually allows me to use e.g. Google Maps offline! Hooray for REST! Some features I wish it had/bugs that annoy me: - being responsive. Actually, I can't tell if the long spinners when Firefox gets pages from it are Firefox's fault or WWWOFFLE's, but I suspect the latter; - noticing DNS server changes, which is apparently impossible to do with the standard resolver library in a portable fashion, other than by restarting the server; - remembering which pages I'd actually asked for, and which ones I was "done with", which clearly requires some better UI; - implicitly recursively getting later pages of multi-page articles (e.g. on Wired and OSNews; clearly this has to have per-site regexes and crap); - storing annotations on the pages so I could search by annotations (it has support for searching your offline cache of part of the web); - a standard blocklist of ad providers, since ads made up a substantial fraction of the hazardous, expensive, battery-draining time that I was online; - better integration into the browser UI --- I'd like to have buttons for "recursively fetch from this page" (which would take me to an options screen, rather than fetch immediately), "block this URL" (likewise, since I'd need to be able to wildcard the URL), "pages that linked to this page", and "already-stored pages linked from this page" --- which should also show up in link colors. - while you can see the list of pending fetches (http://localhost:8080/index/outgoing/?sort=alpha;all) if you see something bad in the list (hundreds of requests for things like http://ad.doubleclick.net/adj/ttm.osnews/ros;tile=2;sz=120x600;ord=08028075), it's too hard to remove it (the remove button is hidden by default and takes you to a different page when you use it) and blacklist it. Blacklisting is an available option, but also hidden by default, and fairly painful to use; also, it doesn't actually remove it from the list of pending fetches. Click "Config", change Path to "Any Path" (two clicks), change Arguments to "Any or none" (two clicks), click "Change URL-SPECIFICATION", page down five times to dont-request, click it, click the "Yes" radio button, click "make change", close window --- 15 UI actions to add each wildcard, and that's once you're looking at the list of pending fetches rather than the page with the offending content, although I suppose the list of pending fetches tells you who the worst offenders are, but it doesn't show where they're linked from or what other similar pages were (although that is available through a few more clicks), so it's hard to distinguish ads from Google Maps images (e.g. kh1.google.com). - a less ugly UI - would it be so hard to use <label> elements in the UI? - better handling of temporary failures, such as timeouts. "Better handling" means "retry". - also it would be nice if I could upload my configuration and list of desired URLs to a server somewhere else (like my colo) which would do the fetches for me while I was offline, and then download a big blob of compressed web pages later. Ideally I could put the configuration and URLs in an encrypted file that I could take to an internet cafe and upload, and download the encrypted blob of compressed web pages in the same way. - when I used Squid for offline web browsing, I added meta http-equiv refreshes (typically refreshing about once every half hour) to all of its error pages, so that any tab displaying an error page would eventually display the web page. - URL blocking that actually worked would be nice too. I added a bunch of domains to the dont-fetch list, but WWWOFFLE kept fetching stuff from them anyway. - hey, it would be nice if I could see why a particular page wasn't fetched after the first time I try to GET it. E.g. if it's an image, the first GET might be as an inline image --- I've seen this happen sometimes. - there's an option to list the "pages" that were requested the last time I was offline. As I said before, it would be nice to have a better display that distinguished pages I actually loaded interactively, by clicking on a link, from inline resources like <iframe>, <script>, <img>, <embed>, and <style> links (which might be difficult without some more browser integration), it would also be nice to be able to see the titles of the pages. As it is, it's difficult to find the dozens of web pages I wanted to read, in among the thousands of Google Maps images. - while wwwoffle does nicely store things on disk in separate files, the content of the HTTP response isn't stored in a separate file --- so you can't use gthumb, gv, file, gzip, and so on, on the files from wwwoffle's cache dirs --- you have to access them through wwwoffle, which means through HTTP. That's a little inconvenient. - At first you might wonder why "sort by file type" doesn't display the content-type it's sorting by. Then you realize that it's actually sorting by the file extension, and is therefore worse than useless: e.g. http://www.folklore.org/StoryView.py?project=Macintosh&story=90_Hours_A_Week_And_Loving_It.txt (which is HTML!) ends up next to http://gnosis.cx/download/gnosis/util/convert/curses_txt2html.py - It's fairly opaque about how it chooses to purge things from cache. There's a "purge" section in the config file, but from my reading of it, it shouldn't delete anything until I haven't read it for four weeks. But in fact it's already deleted a bunch of stuff. - And it would be nice if it remembered more than one version of each page. - strace says it forks a new thread (using clone) for every child request. Perhaps this explains why it is so slow. - And sometimes it tries to send me an error page with "Transfer-Encoding: chunked,chunked" and two layers of chunking, which Firefox doesn't like --- the result is that the error page displays with some three-digit hexadecimal numbers at the top. (I think this should have been fixed in WWWOFFLE 2.8b.) - Every once in a while, especially under heavy load, it serves up the wrong representation --- I got PNGs and GIF89as for a bunch of my HTML pages today. - You can push it on or offline with an HTTP GET. - I caught it doing continuous DNS lookups on www.meetomatic.com, which had an A record, for many minutes --- several requests per second for perhaps half an hour. This wouldn't be so bad except that it did lots of AAAA requests, which failed and thus weren't cached. - Its connect timeouts seem to be painfully short, on the order of tens of seconds. As far as I can tell from reading the changelog (NEWS file), none of the things listed above have been fixed in more recent WWWOFFLEs. Unfortunately I think a good system for this kind of thing implicitly encodes some workflow. At a minimum, each URL is in one of these states: - nobody cares - requested - currently being fetched - waiting to be read - discarded - archived The idea is that from "waiting to be read", a page can go into "discarded" or "archived". I also think you need some kind of categorization system for this --- anyway, I do --- such that some of these states can pertain to a particular category. Normally pages should inherit their categories from the page they were linked from, and you should be able to see and edit the categories for a page in the browser when you have it open.

