I've been banging on the python code again trying to make sense of
the currently broken (in my opinion, no flamewar please) --stayonhost and
--staybelow options. I assert that --stayonhost should stay on a particular
FQDN, which it currently does, but leaves a gaping hole in the
implementation.

        A perfect example is http://slashdot.org/palm. The image that shows
up on that main page comes from images.slashdot.org. If I want to gather
this site to a depth which allows me to get comments (--maxdepth=3 or
greater), it also traverses offsite and to other linked sites. That's bad,
since it gets very large, and we're out of the "prepared for Palm" format of
the site. Right now, I either get the site without comments (useless) or I
include comments, and manually have to add all of the offsite links to
exclusionlist.txt. Very ugly.

        --staybelow=slashdot.org also does not work as you would expect, as
it requires an actual URL there. We're getting into the misconception of the
"up"  and "down" of URLs (as you all know, there is no such thing as "up" or
"down" in any web content. Everything is exactly one hop from everything
else). This IMHO, should be adjusted to take a domain as an argument, not a
URL (complete with protocol). Using --staybelow=slashdot.org and
--staybelow="http://slashdot.org"; has very different results.

        In any case, there's a missing option here (and always has been
missing); --stayondomain. With --stayondomain=slashdot.org, for example,
images.slashdot.org, www.slashdot.org, banjo.slashdot.org, and slashdot.org
can be maintained, and you can "package up" the content so that it never
leaves this domain. I could spider it to a maxdepth of 100, and be assured
that it would never get out of hand and go offsite (yes, the file would be
large, but it would be very self-contained).

        Here's what I propose:

        --stayonhost: Will not ever leave the FQDN you specify in your -H
                      <url> syntax.

        --staybelow: (should take a URI as an argument, not a URL) Will
                     restrict ascention to the supplied URL as a parent.

        --stayondomain: Will never leave the network you specify, so that
                     www.foo.com, images.foo.com, and foo.com will all be
                     assumed to be included in the same "pluck". Content
                     from all "member domains" will be included.

        Sound feasable?



/d


Reply via email to