On 8/25/05, Roger B. <[EMAIL PROTECTED]> wrote: > > Mhh. I have not looked into this. But is not every desktop aggregator > > a robot? > > Henry: Depends on who you ask. (See the Newsmonster debates from a > couple years ago.)
As I am the one who kicked off the Newsmonster debates a couple years ago, I would like to throw in my opinion here. My opinion has not changed, and it is this: 1. If a user gives a feed URL to a program (aggregator, aggregator service, ping service, whatever), the program may request it and re-request it as often as it likes. This is not "robotic" behavior in the robots.txt sense. The program has been given instructions to request a URL, and it does so, perhaps repeatedly. This covers the most common case of a desktop or web-based feed reader or aggregator that reads feeds and nothing else. 2. If a user gives a feed URL to a program *and then the program finds all the URLs in that feed and requests them too*, the program needs to support robots.txt exclusions for all the URLs other than the original URL it was given. This is robotic behavior; it's exactly the same as requesting an HTML page, scraping it for links, and then requesting each of those scraped URLs. The fact that the original URL pointed to an HTML document or an XML document is immaterial; they are clearly the same use case. Programs such as wget may fall into either category, depending on command line options. The user can request a single resource (category 1), or can instruct wget to recursive through links and effectively mirror a remote site (category 2). Section 9.1 of the wget manual describes its behavior in the case of category 2: http://www.delorie.com/gnu/docs/wget/wget_41.html """ For instance, when you issue: wget -r http://www.server.com/ First the index of `www.server.com' will be downloaded. If Wget finds that it wants to download more documents from that server, it will request `http://www.server.com/robots.txt' and, if found, use it for further downloads. `robots.txt' is loaded only once per each server. """ So wget downloads the URL it was explicitly given, but then if it's going to download any other autodiscovered URLs, it checks robots.txt to make sure that's OK. Bringing this back to feeds, aggregators can fall into either category (1 or 2, above). At the moment, the vast majority of aggregators fall into category 1. *However*, what Newsmonster did 2 years ago pushed it into category 2 in some cases. It had a per-feed option to prefetch and cache the actual HTML pages linked by excerpt-only feeds. When it fetched the feed, Newsmonster would go out and also fetch the page pointed to by the item's <link> element. This is actually a very useful feature; my only problem with it was that it did not respect robots.txt *when it went outside the original feed URL and fetched other resources*. Nor is this limited to prefetching HTML pages. The same problem arises with aggregators that automatically download *any* linked content, such as enclosures. The end user gave their aggregator the URL of a feed, so the aggregator may poll that feed from now until the end of time (or 410 Gone, whichever comes first :). But if the aggregator reads that feed and subsequently decides to request resources other than the original feed URL (like .mp3 files), the aggregator should support robots.txt for those other URLs. (And before you say "but my aggregator is nothing but a podcast client, and the feeds are nothing but links to enclosures, so it's obvious that the publisher wanted me to download them" -- WRONG! The publisher might want that, or they might not. They might publish a few selected files on a high-bandwidth server where anything goes, and other files on a low-bandwidth server where they would prefer that users explicitly click the link to download the file if they really want it. Or they might want some types of clients (like personal desktop aggregators) to download those files and other types of clients (like centralized aggregation services) not to download them. Or someone might set up a malicious feed that intentionally pointed to large files on someone else's server... a kind of platypus DoS attack. Or any number of other scenarios. So how do you, as a client, know what to do? robots.txt.) -- Cheers, -Mark