> > Mine does, I'm not sure if the Python code currently does.

> It doesn't seem to do so. Otherwise it wouldn't fetch The Onion. Or is
> this configurable?

        Odd, The Onion's robots.txt looks like this:

        User-Agent: *
        Allow: /index.html
        Disallow: /

        That generally disallows spidering beyond /, but not if you fetch
each link individually, since they DO allow index.html itself. wget -i
index.html to a depth of 1 will subsequently fetch those reliably, without
any problems, just as a test.

> I was thinking of maintaining a simple "registry" (just an XML file on a
> web server) that specifies sites/servers where you have to be extra
> careful.

        Are you suggesting that the parser/distiller fetch this file at
every Pluck? I would consider that a bit of an issue, and I know I
personally wouldn't use that feature, as would a lot of other people. It's
akin to launching an app and having it send out a little UDP packet to let
the server or author know you're using his software. No thanks.

        I understand your needs are different, but the design approach needs
to be local, and self-contained, not centralized on a server. Remember, the
design of Plucker keeps the CLIENT in control, not the server.

> This way JPluck (or any other parser) can automatically lower the
> connection count if it is set too high by the user, or introduce a delay.

        Ick. If I set or override a parameter, I EXPECT it to stay
overridden, and not reverted by some tool I can't control upstream.

> A parser fetches this registry periodically to refresh its local copy.

        As an option, defaulted to off, and enabled by the user, perhaps.

> If the spider encounters a 404 the connection is simply broken and the
> content section is not read, only the headers.

        Not with the server being Apache and Linux/Windows as clients. I'm
not sure what braindead things IIS or other server do (except that "hiccup"
that Microsoft added, if your error page is not > 512 bytes, you get the
internal IE error message, not the remote 404 document), but I know that a
GET request on a page which returns a 404, DOES request the entire length of
the document and sends it to the client.

> If the server sends a 90k 404 not found page that is their problem.

        ..and a problem for the client. If you hit a page with 100 links,
and 90 of them are 404's, and you spider into them, that is potentially 90k
times 90 links. That's 8.2 megs of _useless_ data fetched from the server to
the client. A huge waste.

> Anyway, browsers perform requests this way (with an if-modified-since) as
> well, so I don't see an issue with this.

        Completely useless on dynamic sites, as I've mentioned before. If
the server is serving up dynamic pages, it has no way of checking what YOUR
content-modification date is. This also increases bandwidth as well, though
to a lesser degree.

> That is HTTP keep-alive, yes. Also supported by Java.

        And not supported by most servers, since it requires HTTP 1.1, but
using it will alleviate it on servers which do support it (namly Apache).

        CGI/ASP applications in IIS 5.0 and IIS 4.0 can't use the Keep-Alive
features of HTTP 1.1, no matter what the applications do. You'll notice that
IIS sends the Connection:Closed HTTP header even though the browser
indicates that it wants a Keep-Alive connection and you've enabled
Keep-Alive headers in IIS.

> "if-modified-since" is a *request* header sent by the client to indicate
> the age of the copy in its cache. And yes, many sites respond to that,
> with a 304 Not Modified.

        I'm aware, but see above, sending the if-modified-since to a site
using dynamic content will be completely moot, and only waste your local
bandwidth for the length of that header being sent per-request. That's all.

> So that is actually a reason to default to AvantGo?

        No, do _NOT_ default it to AvantGo, because as I said, Plucker is
not AvantGo. Besides, we might be getting ourselves into trouble if we say
we're them in the UserAgent string by default, or ship a tool that does so.
Make it an option to enable "AvantGo Heuristics" as I've done in the email
parser.

> They probably check the REMOTE_ADDR of the request, which is supplied by
> the server to the application itself.

        It's more than that. I've spent considerable amounts of time
actually manufacturing TCP packets and watching how they respond to them (or
not respond) when debugging some of the newer AvantGo blocking mechanisms.

        Checking the headers isn't the best approach anyway, as most can be
faked. The best way to get around space.com is not to use them, and let
their readership fall. If they want to limit their audience to those using a
proprietary, bloated, broken client, then they should bear the consequences.

        This discussion is probably better suited to plucker-dev now..


d.


_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to