David A. Desrosiers wrote: > I think what will piss off users more is being banned for pounding > the websites of the content providers.
Agreed. > >> Does plucker-build parse robots.txt? > > Mine does, I'm not sure if the Python code currently does. It doesn't seem to do so. Otherwise it wouldn't fetch The Onion. Or is this configurable? >> Also, should robots.txt parsing be mandatory or on a voluntary >> basis? If you make this configurable, the first thing people will do >> is shut it off. > > Yes, but it should be there, or at least allow a "Wait 'n' seconds > between pages fetched from this host" or some such delay factor. I was thinking of maintaining a simple "registry" (just an XML file on a web server) that specifies sites/servers where you have to be extra careful. This way JPluck (or any other parser) can automatically lower the connection count if it is set too high by the user, or introduce a delay. A parser fetches this registry periodically to refresh its local copy. >> The response to the GET will tell you whether it's valid. > > Bzzt. If you have a 90k index.html that is for a 404 message, you > fetch 90k of data, then dump it. You're not looking at the response > code in a fetch, even if it's a 200 for a 404 page. HEAD will tell > you this, generally, and is *MUCH* faster than doing a GET. If the spider encounters a 404 the connection is simply broken and the content section is not read, only the headers. If the server sends a 90k 404 not found page that is their problem. Anyway, browsers perform requests this way (with an if-modified-since) as well, so I don't see an issue with this. >Besides, > you can do a HEAD and GET in the same socket open sequence.. well, > you can in Perl anyway, not sure if/how Java/Python handles that. That is HTTP keep-alive, yes. Also supported by Java. >> The if-modified-since request header is there to avoid doing a HEAD >> request. > > And 95% or more of websites do not use this header at all (if you > mean "Last-modified" or the less-commonly-used "If-Modified-Since"), > so it's not doing you much. I just tried about 15 of the common > (large) sites I pull content from, and NONE of them use this header. > The Plucker main website, however does use "Last-Modified". =) "if-modified-since" is a *request* header sent by the client to indicate the age of the copy in its cache. And yes, many sites respond to that, with a 304 Not Modified. That status code is what you should be looking for. It's similar to HEAD in that no content follows. This is the way browsers handle caching I can see it all in the HTTP headers flashing by here in Proxomitron, anyway. The Yahoo homepage and the server keeps its cool by sending 304s for the images(which rarely change anyway). JPluck works the same way. Many of my regular sites produce 304s once the if-modified-since header is present. Also with HTML content. So that saves bandwidth. > Absolutely impossible to tell from dyamic content, and you can't sum > the page either, since most of them will carry the "Client-Date" > header in their response, which will stuff up the ability to > "compare" the page to one stored in a local cache for example. It depends on how your web application is written. Say you have a database that you know is refreshed only once a day at 0:00. If a client specifies that the copy in its cache was retrieved later than that the application can respond with a 304. Many web application don't bother with this, though. It is very hard and requires more work in tracking data and modifications on the back-end. > >> As an aside: do people think the default user-agent should be >> AvantGo? > > No, because we aren't AvantGo. > >> Personally, I don't like to play hide-and-seek, but some people >> might be concerned that when too much of these "alien" user agents >> show up in the log of AvantGo sites, webmasters will take better >> measures to stop non-AvantGo clients from retrieving content. > > They already do, see my original reason for including the UserAgent > and Referer headers and parser options in the Python parser. So that is actually a reason to default to AvantGo? >> The PDA version of space.com already has such a protection. They >> seem to scan the IP address, which is protection that cannot be >> easily foiled. > > No, they check the source IP in the actual connection packets > themselves, if you don't come from a known AvantGo domain or > netblock, you aren't allowed into that particular part of their page. They probably check the REMOTE_ADDR of the request, which is supplied by the server to the application itself. Regards -Laurens _______________________________________________ plucker-list mailing list [EMAIL PROTECTED] http://lists.rubberchicken.org/mailman/listinfo/plucker-list

