> None yet, but these could be added easily. However, I think it'll piss off
> some users.

        I think what will piss off users more is being banned for pounding
the websites of the content providers. Remember, they are the ones we have
to be nice to in the long run. Without content providers and web content,
Plucker just becomes a glorified (locally-parsing) a doc reader. Not fun.

> For instance, The Onion has a robots.txt that disallows everything but the
> home page. http://mobile.theonion.com/robots.txt.

        Then they must have exclusions for their providers, like AvantGo and
friends, no? We should allow tuning of the fetching. Of course, this is the
"paradox pyramid" as they say in the security industry. Users want the
content as fast as possible, and doing so, means they pound the site for it.
Content providers want to stay up and be able to respond to requests for
content in a reasonable amount of time for all users, so they throttle it.
If you delay the fetching (slow down the time it takes to get content), you
make the content providers happy, and piss off the users. If you remove the
delays in fetching, you make the users happy, to a point, and piss off the
content providers, who will just lock us out, either by IP or UserAgent.

> Does plucker-build parse robots.txt?

        Mine does, I'm not sure if the Python code currently does.

> Also, should robots.txt parsing be mandatory or on a voluntary basis? If
> you make this configurable, the first thing people will do is shut it off.

        Yes, but it should be there, or at least allow a "Wait 'n' seconds
between pages fetched from this host" or some such delay factor.

> The response to the GET will tell you whether it's valid.

        Bzzt. If you have a 90k index.html that is for a 404 message, you
fetch 90k of data, then dump it. You're not looking at the response code in
a fetch, even if it's a 200 for a 404 page. HEAD will tell you this,
generally, and is *MUCH* faster than doing a GET. Besides, you can do a HEAD
and GET in the same socket open sequence.. well, you can in Perl anyway, not
sure if/how Java/Python handles that.

> Performing an extra HEAD only leads to more traffic(if only marginal).

        Extra, yes, the point is that you reduce the number of GET requests
by performing HEAD requests instead, and a GET for the HEAD requests which
return 200.

> The if-modified-since request header is there to avoid doing a HEAD
> request.

        And 95% or more of websites do not use this header at all (if you
mean "Last-modified" or the less-commonly-used "If-Modified-Since"), so it's
not doing you much. I just tried about 15 of the common (large) sites I pull
content from, and NONE of them use this header. The Plucker main website,
however does use "Last-Modified". =)

        Wired, Slashdot, Yahoo, Onion, etc. all so not support this header.

> This makes sense, it's often hard to say when exactly a resource was last
> modified but it's easier to say that it hasn't been modified since a
> particular date.

        Absolutely impossible to tell from dyamic content, and you can't sum
the page either, since most of them will carry the "Client-Date" header in
their response, which will stuff up the ability to "compare" the page to one
stored in a local cache for example.

> As an aside: do people think the default user-agent should be AvantGo?

        No, because we aren't AvantGo.

> Personally, I don't like to play hide-and-seek, but some people might be
> concerned that when too much of these "alien" user agents show up in the
> log of AvantGo sites, webmasters will take better measures to stop
> non-AvantGo clients from retrieving content.

        They already do, see my original reason for including the UserAgent
and Referer headers and parser options in the Python parser.

> The PDA version of space.com already has such a protection. They seem to
> scan the IP address, which is protection that cannot be easily foiled.

        No, they check the source IP in the actual connection packets
themselves, if you don't come from a known AvantGo domain or netblock, you
aren't allowed into that particular part of their page.

        The scanning is getting more aggressive, and likely won't stop for
quite some time. The way to get around it is to talk to the webmasters, let
them know what you're doing, how you're using their content, and why your
use of it doesn't hurt them or infringe on their ability to make money from
it (banner ads, etc.) We're here to help, not hurt, and they need to
understand that.


d.


_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to