Mostly a fairly random collection of notes.

I'm using WWWOFFLE 2.8d now for offline web browsing.  It's a proxy
server that saves all the pages you ask for, and in offline mode,
returns you an ugly page that tells you it remembered your request.
Then, when you go online and give it the "fetch" command, it downloads
all the pages you previously asked for, plus all their inline images,
stylesheets, Flash files, and so on.

The main differences between WWWOFFLE and a normal caching proxy server:
- it caches stuff longer than the HTTP spec says it should, and in
  offline mode, it serves it up when you ask for it (in lieu of an
  error page) even if the HTTP spec says it's stale;
- it remembers requests it couldn't fulfill in order to fulfill them
  later.

It also has a fairly kickass recursive-get mode.

It actually allows me to use e.g. Google Maps offline!  Hooray for
REST!

Some features I wish it had/bugs that annoy me:
- being responsive.  Actually, I can't tell if the long spinners when
  Firefox gets pages from it are Firefox's fault or WWWOFFLE's, but I
  suspect the latter;
- noticing DNS server changes, which is apparently impossible to do
  with the standard resolver library in a portable fashion, other than
  by restarting the server;
- remembering which pages I'd actually asked for, and which ones I was
  "done with", which clearly requires some better UI;
- implicitly recursively getting later pages of multi-page articles
  (e.g. on Wired and OSNews; clearly this has to have per-site regexes
  and crap);
- storing annotations on the pages so I could search by annotations
  (it has support for searching your offline cache of part of the
  web);
- a standard blocklist of ad providers, since ads made up a
  substantial fraction of the hazardous, expensive, battery-draining
  time that I was online;
- better integration into the browser UI --- I'd like to have buttons
  for "recursively fetch from this page" (which would take me to an
  options screen, rather than fetch immediately), "block this URL"
  (likewise, since I'd need to be able to wildcard the URL), "pages
  that linked to this page", and "already-stored pages linked from
  this page" --- which should also show up in link colors.
- while you can see the list of pending fetches
  (http://localhost:8080/index/outgoing/?sort=alpha;all) if you see
  something bad in the list (hundreds of requests for things like
  http://ad.doubleclick.net/adj/ttm.osnews/ros;tile=2;sz=120x600;ord=08028075),
  it's too hard to remove it (the remove button is hidden by default
  and takes you to a different page when you use it) and blacklist it.
    Blacklisting is an available option, but also hidden by default,
    and fairly painful to use; also, it doesn't actually remove it
    from the list of pending fetches.  Click "Config", change Path to
    "Any Path" (two clicks), change Arguments to "Any or none" (two
    clicks), click "Change URL-SPECIFICATION", page down five times to
    dont-request, click it, click the "Yes" radio button, click "make
    change", close window --- 15 UI actions to add each wildcard, and
    that's once you're looking at the list of pending fetches rather
    than the page with the offending content, although I suppose the
    list of pending fetches tells you who the worst offenders are, but
    it doesn't show where they're linked from or what other similar
    pages were (although that is available through a few more clicks),
    so it's hard to distinguish ads from Google Maps images
    (e.g. kh1.google.com).
- a less ugly UI
- would it be so hard to use <label> elements in the UI?
- better handling of temporary failures, such as timeouts.  "Better
  handling" means "retry".
- also it would be nice if I could upload my configuration and list of
  desired URLs to a server somewhere else (like my colo) which would
  do the fetches for me while I was offline, and then download a big
  blob of compressed web pages later.  Ideally I could put the
  configuration and URLs in an encrypted file that I could take to an
  internet cafe and upload, and download the encrypted blob of
  compressed web pages in the same way.
- when I used Squid for offline web browsing, I added meta http-equiv
  refreshes (typically refreshing about once every half hour) to all
  of its error pages, so that any tab displaying an error page would
  eventually display the web page.
- URL blocking that actually worked would be nice too.  I added a
  bunch of domains to the dont-fetch list, but WWWOFFLE kept fetching
  stuff from them anyway.
- hey, it would be nice if I could see why a particular page wasn't
  fetched after the first time I try to GET it.  E.g. if it's an
  image, the first GET might be as an inline image --- I've seen this
  happen sometimes.
- there's an option to list the "pages" that were requested the last
  time I was offline.  As I said before, it would be nice to have a
  better display that distinguished pages I actually loaded
  interactively, by clicking on a link, from inline resources like
  <iframe>, <script>, <img>, <embed>, and <style> links (which might
  be difficult without some more browser integration), it would also
  be nice to be able to see the titles of the pages.  As it is, it's
  difficult to find the dozens of web pages I wanted to read, in among
  the thousands of Google Maps images.
- while wwwoffle does nicely store things on disk in separate files,
  the content of the HTTP response isn't stored in a separate file ---
  so you can't use gthumb, gv, file, gzip, and so on, on the files
  from wwwoffle's cache dirs --- you have to access them through
  wwwoffle, which means through HTTP.  That's a little inconvenient.
- At first you might wonder why "sort by file type" doesn't display
  the content-type it's sorting by.  Then you realize that it's
  actually sorting by the file extension, and is therefore worse than
  useless:
  e.g. 
http://www.folklore.org/StoryView.py?project=Macintosh&story=90_Hours_A_Week_And_Loving_It.txt
  (which is HTML!)  ends up next to
  http://gnosis.cx/download/gnosis/util/convert/curses_txt2html.py
- It's fairly opaque about how it chooses to purge things from cache.
  There's a "purge" section in the config file, but from my reading of
  it, it shouldn't delete anything until I haven't read it for four
  weeks.  But in fact it's already deleted a bunch of stuff.
- And it would be nice if it remembered more than one version of each
  page.
- strace says it forks a new thread (using clone) for every child
  request.  Perhaps this explains why it is so slow.
- And sometimes it tries to send me an error page with
  "Transfer-Encoding: chunked,chunked" and two layers of chunking,
  which Firefox doesn't like --- the result is that the error page
  displays with some three-digit hexadecimal numbers at the top.  (I
  think this should have been fixed in WWWOFFLE 2.8b.)
- Every once in a while, especially under heavy load, it serves up the
  wrong representation --- I got PNGs and GIF89as for a bunch of my
  HTML pages today.
- You can push it on or offline with an HTTP GET.
- I caught it doing continuous DNS lookups on www.meetomatic.com,
  which had an A record, for many minutes --- several requests per
  second for perhaps half an hour.  This wouldn't be so bad except
  that it did lots of AAAA requests, which failed and thus weren't
  cached.
- Its connect timeouts seem to be painfully short, on the order of
  tens of seconds.

As far as I can tell from reading the changelog (NEWS file), none of
the things listed above have been fixed in more recent WWWOFFLEs.

Unfortunately I think a good system for this kind of thing implicitly
encodes some workflow.  At a minimum, each URL is in one of these
states:
- nobody cares
- requested
- currently being fetched
- waiting to be read
- discarded
- archived

The idea is that from "waiting to be read", a page can go into
"discarded" or "archived".  I also think you need some kind of
categorization system for this --- anyway, I do --- such that some of
these states can pertain to a particular category.  Normally pages
should inherit their categories from the page they were linked from,
and you should be able to see and edit the categories for a page in
the browser when you have it open.

Reply via email to