David A. Desrosiers wrote:
> I think what will piss off users more is being banned for pounding
> the websites of the content providers.

Agreed.

>
>> Does plucker-build parse robots.txt?
>
> Mine does, I'm not sure if the Python code currently does.

It doesn't seem to do so. Otherwise it wouldn't fetch The Onion. Or is this
configurable?

>> Also, should robots.txt parsing be mandatory or on a voluntary
>> basis? If you make this configurable, the first thing people will do
>> is shut it off.
>
> Yes, but it should be there, or at least allow a "Wait 'n' seconds
> between pages fetched from this host" or some such delay factor.

I was thinking of maintaining a simple "registry" (just an XML file on a web
server) that specifies sites/servers where you have to be extra careful.
This way JPluck (or any other parser) can automatically lower the connection
count if it is set too high by the user, or introduce a delay. A parser
fetches this registry periodically to refresh its local copy.

>> The response to the GET will tell you whether it's valid.
>
> Bzzt. If you have a 90k index.html that is for a 404 message, you
> fetch 90k of data, then dump it. You're not looking at the response
> code in a fetch, even if it's a 200 for a 404 page. HEAD will tell
> you this, generally, and is *MUCH* faster than doing a GET.

If the spider encounters a 404 the connection is simply broken and the
content section is not read, only the headers. If the server sends a 90k 404
not found page that is their problem. Anyway, browsers perform requests this
way (with an if-modified-since) as well, so I don't see an issue with this.

>Besides,
> you can do a HEAD and GET in the same socket open sequence.. well,
> you can in Perl anyway, not sure if/how Java/Python handles that.

That is HTTP keep-alive, yes. Also supported by Java.

>> The if-modified-since request header is there to avoid doing a HEAD
>> request.
>
> And 95% or more of websites do not use this header at all (if you
> mean "Last-modified" or the less-commonly-used "If-Modified-Since"),
> so it's not doing you much. I just tried about 15 of the common
> (large) sites I pull content from, and NONE of them use this header.
> The Plucker main website, however does use "Last-Modified". =)

"if-modified-since" is a *request* header sent by the client to indicate the
age of the copy in its cache. And yes, many sites respond to that, with a
304
Not Modified. That status code is what you should be looking for. It's
similar to HEAD in that no content follows. This is the way browsers handle
caching I can see it all in the HTTP headers flashing by here in
Proxomitron, anyway. The Yahoo homepage and the server keeps its cool by
sending 304s for the images(which rarely change anyway). JPluck works the
same way. Many of my regular sites produce 304s once the if-modified-since
header is present. Also with HTML content. So that saves bandwidth.

> Absolutely impossible to tell from dyamic content, and you can't sum
> the page either, since most of them will carry the "Client-Date"
> header in their response, which will stuff up the ability to
> "compare" the page to one stored in a local cache for example.

It depends on how your web application is written. Say you have a database
that you know is refreshed only once a day at 0:00. If a client specifies
that the copy in its cache was retrieved later than that the application can
respond with a 304. Many web application don't bother with this, though. It
is very hard and requires more work in tracking data and modifications on
the back-end.

>
>> As an aside: do people think the default user-agent should be
>> AvantGo?
>
> No, because we aren't AvantGo.
>
>> Personally, I don't like to play hide-and-seek, but some people
>> might be concerned that when too much of these "alien" user agents
>> show up in the log of AvantGo sites, webmasters will take better
>> measures to stop non-AvantGo clients from retrieving content.
>
> They already do, see my original reason for including the UserAgent
> and Referer headers and parser options in the Python parser.

So that is actually a reason to default to AvantGo?

>> The PDA version of space.com already has such a protection. They
>> seem to scan the IP address, which is protection that cannot be
>> easily foiled.
>
> No, they check the source IP in the actual connection packets
> themselves, if you don't come from a known AvantGo domain or
> netblock, you aren't allowed into that particular part of their page.

They probably check the REMOTE_ADDR of the request, which is supplied by the
server to the application itself.


Regards
-Laurens

_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to