David A. Desrosiers wrote:
> Rob warned us about this several months ago, they monitor the logs
> at Slashdot VERY closely, and he specifically said to point to the
> version at http://slashd.org/palm version of the site, and NOT to
> pound it.

That is why pushing the 0.9 release is the right thing to do. I did not
foresee that JPluck would take off like this and the report of an IP ban
prompted me into action. The delay between connections, the HTTP cache, and
the lowered limit on the number of simultaneous connections should be
adequate to address this concern for the future.

> What facilities in JPluck are you using to adhere to robots.txt?

None yet, but these could be added easily. However, I think it'll piss off
some users. For instance, The Onion has a robots.txt that disallows
everything but the home page. http://mobile.theonion.com/robots.txt.

Does plucker-build parse robots.txt?

Also, should robots.txt parsing be mandatory or on a voluntary basis? If you
make this configurable, the first thing people will do is shut it off.

> Are you also using (as I am in perl) a simultaneous HEAD request to
> see that the page is indeed valid, before making a GET request for it?

The response to the GET will tell you whether it's valid. Performing an
extra HEAD only leads to more traffic(if only marginal). The
if-modified-since request header is there to avoid doing a HEAD request.
(Browsers do it this way as well.) Otherwise you have to retrieve the
last-modified date using a HEAD, then decide whether to perform a GET based
on that. Also, servers often do not return a last-modified date in their
response, but do handle if-modified-since. This makes sense, it's often hard
to say when exactly a resource was last modified but it's easier to say that
it hasn't been modified since a particular date.

As an aside: do people think the default user-agent should be AvantGo?
Personally, I don't like to play hide-and-seek, but some people might be
concerned that when too much of these "alien" user agents show up in the log
of AvantGo sites, webmasters will take better measures to stop non-AvantGo
clients from retrieving content. The PDA version of space.com already has
such a protection. They seem to scan the IP address, which is  protection
that cannot be easily foiled.


Regards
-Laurens

_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to