Re: Don't Aggregrate Me

Mark Pilgrim Fri, 26 Aug 2005 11:53:32 -0700

On 8/25/05, Roger B. <[EMAIL PROTECTED]> wrote:
> > Mhh. I have not looked into this. But is not every desktop aggregator
> > a robot?
> 
> Henry: Depends on who you ask. (See the Newsmonster debates from a
> couple years ago.)


As I am the one who kicked off the Newsmonster debates a couple years
ago, I would like to throw in my opinion here.  My opinion has not
changed, and it is this:

1. If a user gives a feed URL to a program (aggregator, aggregator
service, ping service, whatever), the program may request it and
re-request it as often as it likes.  This is not "robotic" behavior in
the robots.txt sense.  The program has been given instructions to
request a URL, and it does so, perhaps repeatedly.  This covers the
most common case of a desktop or web-based feed reader or aggregator
that reads feeds and nothing else.

2. If a user gives a feed URL to a program *and then the program finds
all the URLs in that feed and requests them too*, the program needs to
support robots.txt exclusions for all the URLs other than the original
URL it was given.  This is robotic behavior; it's exactly the same as
requesting an HTML page, scraping it for links, and then requesting
each of those scraped URLs.  The fact that the original URL pointed to
an HTML document or an XML document is immaterial; they are clearly
the same use case.

Programs such as wget may fall into either category, depending on
command line options.  The user can request a single resource
(category 1), or can instruct wget to recursive through links and
effectively mirror a remote site (category 2).  Section 9.1 of the
wget manual describes its behavior in the case of category 2:

http://www.delorie.com/gnu/docs/wget/wget_41.html
"""
For instance, when you issue:

wget -r http://www.server.com/

First the index of `www.server.com' will be downloaded. If Wget finds
that it wants to download more documents from that server, it will
request `http://www.server.com/robots.txt' and, if found, use it for
further downloads. `robots.txt' is loaded only once per each server.
"""

So wget downloads the URL it was explicitly given, but then if it's
going to download any other autodiscovered URLs, it checks robots.txt
to make sure that's OK.

Bringing this back to feeds, aggregators can fall into either category
(1 or 2, above).  At the moment, the vast majority of aggregators fall
into category 1.  *However*, what Newsmonster did 2 years ago pushed
it into category 2 in some cases.  It had a per-feed option to
prefetch and cache the actual HTML pages linked by excerpt-only feeds.
 When it fetched the feed, Newsmonster would go out and also fetch the
page pointed to by the item's <link> element.  This is actually a very
useful feature; my only problem with it was that it did not respect
robots.txt *when it went outside the original feed URL and fetched
other resources*.

Nor is this limited to prefetching HTML pages.  The same problem
arises with aggregators that automatically download *any* linked
content, such as enclosures.  The end user gave their aggregator the
URL of a feed, so the aggregator may poll that feed from now until the
end of time (or 410 Gone, whichever comes first :).  But if the
aggregator reads that feed and subsequently decides to request
resources other than the original feed URL (like .mp3 files), the
aggregator should support robots.txt for those other URLs.

(And before you say "but my aggregator is nothing but a podcast
client, and the feeds are nothing but links to enclosures, so it's
obvious that the publisher wanted me to download them" -- WRONG!  The
publisher might want that, or they might not.  They might publish a
few selected files on a high-bandwidth server where anything goes, and
other files on a low-bandwidth server where they would prefer that
users explicitly click the link to download the file if they really
want it.  Or they might want some types of clients (like personal
desktop aggregators) to download those files and other types of
clients (like centralized aggregation services) not to download them. 
Or someone might set up a malicious feed that intentionally pointed to
large files on someone else's server... a kind of platypus DoS attack.
 Or any number of other scenarios.  So how do you, as a client, know
what to do?  robots.txt.)

-- 
Cheers,
-Mark

Re: Don't Aggregrate Me

Reply via email to