Re: Don't Aggregrate Me
On 8/26/05, Graham [EMAIL PROTECTED] wrote: (And before you say but my aggregator is nothing but a podcast client, and the feeds are nothing but links to enclosures, so it's obvious that the publisher wanted me to download them -- WRONG! The publisher might want that, or they might not ... So you're saying browsers should check robots.txt before downloading images? It's sad that such an inane dodge would even garner any attention at all, much less require a response. http://www.robotstxt.org/wc/faq.html What is a WWW robot? A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. Note that recursive here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot. Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images). Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them. On a more personal note, I would like to thank you for reminding me why there will never be an Atom Implementor's Guide. http://diveintomark.org/archives/2004/08/16/specs -- Cheers, -Mark
Re: Don't Aggregrate Me
On Monday, August 29, 2005, at 10:12 AM, Mark Pilgrim wrote: On 8/26/05, Graham [EMAIL PROTECTED] wrote: (And before you say but my aggregator is nothing but a podcast client, and the feeds are nothing but links to enclosures, so it's obvious that the publisher wanted me to download them -- WRONG! The publisher might want that, or they might not ... So you're saying browsers should check robots.txt before downloading images? ... Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images). As has been suggested, to inline images, we need to add frame documents, stylesheets, Java applets, external JavaScript code, objects such as Flash files, etc., etc., etc. The question is, with respect to feed readers, do external feed content (content src=... /), enclosures, etc. fall into the same exceptions category or not? If not, then what's the best mechanism for telling feed readers whether they can download them automatically--robots.txt, another file like robots.txt, or something in the XML? I'd prefer something in the XML. A possibility: feed ext:auto-download target=enclosures default=false / ext:auto-download target=content default =true / ... entry link rel=enclosure href=... ext:auto-download=yes / content src=... ext:auto-download=0 / ...
Re: Don't Aggregrate Me
On Monday, August 29, 2005, at 10:39 AM, Antone Roundy wrote: ext:auto-download target=enclosures default=false / More robust would be: ext:auto-download target=[EMAIL PROTECTED]'enclosure'] default=false / ...enabling extension elements to be named in @target without requiring a list of @target values to be maintained anywhere.
Re: Don't Aggregrate Me
* Antone Roundy [EMAIL PROTECTED] [2005-08-29 19:00]: More robust would be: ext:auto-download target=[EMAIL PROTECTED]'enclosure'] default=false / ...enabling extension elements to be named in @target without requiring a list of @target values to be maintained anywhere. Is it wise to require either XPath support in consumers or to formulate a hackneyed XPath subset specifically for this purpose? And what about namespaced elements? And what about intermediaries which transcribe the content into a document with different NS prefixes? I think sticking to just an @ext:auto-download attribute applicable to single elements is the wise thing to do. Of course, I wonder if we can’t simply use @xlink:type for the purpose… (I admit ignorance of the specifics of XLink, so this idea might be useless.) Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Don't Aggregrate Me
--On Monday, August 29, 2005 10:39:33 AM -0600 Antone Roundy [EMAIL PROTECTED] wrote: As has been suggested, to inline images, we need to add frame documents, stylesheets, Java applets, external JavaScript code, objects such as Flash files, etc., etc., etc. The question is, with respect to feed readers, do external feed content (content src=... /), enclosures, etc. fall into the same exceptions category or not? Of course a feed reader can read the feed, and anything required to make it readable. Duh. And all this time, I thought robots.txt was simple. robots.txt is a polite hint from the publisher that a robot (not a human) probably should avoid those URLs. Humans can do any stupid thing they want, and probably will. The robots.txt spec is silent on what to do with URLs manually-added to a robot. The normal approach is to deny those, with a message that they are disallowed by robots.txt, and offer some way to override that. wunder -- Walter Underwood Principal Architect Verity Ultraseek
Re: Don't Aggregrate Me
* Mark Pilgrim [EMAIL PROTECTED] [2005-08-29 18:20]: On 8/26/05, Graham [EMAIL PROTECTED] wrote: So you're saying browsers should check robots.txt before downloading images? It's sad that such an inane dodge would even garner any attention at all, much less require a response. I’m with you on how robots.txt is to be interpreted, but to a point there is a point to the dodge. F.ex, your example of pointing an enclosure to a large file on a foreign server in order to perform a DoS against it is equally practicable by pointing an img src= to it from a high traffic site. The distinction between what’s inline content and what’s not really is more arbitrary than inherent. Of course, that’s just splitting hairs, since it doesn’t actually make a difference to the interpretation. Crawlers generally don’t traverse img/@src references, and the few that do, such as Google’s and Yahoo’s image search services, respect robots.txt. Further, aggregation services do not retrieve images referenced in the content of the feeds they consume. So why should they retrieve enclosures? Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Don't Aggregrate Me
Le 05-08-26 à 18:59, Bob Wyman a écrit : Karl, Please, accept my apologies for this. I could have sworn we had the policy prominently displayed on the site. I know we used to have it there. This must have been lost when we did a site redesign last November! I'm really surprised that it has taken this long to notice that it is gone. I'll see that we get it back up. Thank you very much for your honest answer. Much appreciated. You see educating users is not obvious it seems ;) No offense, it just shows that it is not an easy accessible information. And there's a need to educate Services too. Point taken. I'll get it fixed. It's a weekend now. Give me a few days... I'm not sure, but I think it makes sense to put this on the add-feed page at: http://www.pubsub.com/add_feed.php . Do you agree? Yes I guess a warning here plus a way of saying to users that they can change their mind later on might be useful. Yes, forged pings or unauthorized third-party pings are a real issue. Unfortunately, the current design of the pinging system gives us absolutely no means to determine if a ping is authorized by the publisher. Exact. We will run into Identification problems if we go further, with big privacy issues. I argued last year that we should develop a blogging or syndication architecture document in much the same way that the TAG documented the web architecture and in the way that most decent standards groups usually produce some sort of reference architecture document. Yes I remember that. I remember you talking about it at the New-York meeting, we had in May 2004. Some solutions, like requiring that pings be signed would work from a technical point of view, but are probably not practical except in some limited cases. (e.g. Signatures may make sense as a way to enable Fat Pings from small or personal blog sites. The thing is that a unique solution will not be enough. I may want to be able - to “authorize” services A, B and C to do things with my content, - but to forbid services X, Y and Z to use my content. Right now it's very hard to do that, except if you are a geek and you can block bots by their IP address and hoping that this IP will not change. It's why I would think that having services respecting license of contents would be a first step. In a service which aggregates the news from different sources. Some of the sources might be licensed for commercial use and some others not at all. Flickr has the start of a very interesting acknowledgement of that somehow. http://www.flickr.com/creativecommons/ * Maybe services like PubSub, Technorati, Bloglines, etc. should display the license in the search results. That would be a first step. * Second step would be to not use the content in a commercial activity if it has been marked as such. (data mining, marketing profile, etc.) -- Karl Dubost - http://www.w3.org/People/karl/ W3C Conformance Manager *** Be Strict To Be Cool ***
Re: Don't Aggregrate Me
link rel=enclosure href=http://www.example.com/enclosure.mp3; x:follow=no / link rel=enclosure href=http://www.example.com/enclosure.mp3; x:follow=yes / content src=http://www.example.com/enclosure.mp3; x:follow=no / content src=http://www.example.com/enclosure.mp3; x:follow=yes / ??? - James A. Pagaltzis wrote: * Antone Roundy [EMAIL PROTECTED] [2005-08-29 19:00]: More robust would be: ext:auto-download target=[EMAIL PROTECTED]'enclosure'] default=false / ...enabling extension elements to be named in @target without requiring a list of @target values to be maintained anywhere. Is it wise to require either XPath support in consumers or to formulate a hackneyed XPath subset specifically for this purpose? And what about namespaced elements? And what about intermediaries which transcribe the content into a document with different NS prefixes? I think sticking to just an @ext:auto-download attribute applicable to single elements is the wise thing to do. Of course, I wonder if we can’t simply use @xlink:type for the purpose… (I admit ignorance of the specifics of XLink, so this idea might be useless.) Regards,
Re: Don't Aggregrate Me
On 30/8/05 11:19 AM, James M Snell [EMAIL PROTECTED] wrote: link rel=enclosure href=http://www.example.com/enclosure.mp3; x:follow=no / link rel=enclosure href=http://www.example.com/enclosure.mp3; x:follow=yes / content src=http://www.example.com/enclosure.mp3; x:follow=no / content src=http://www.example.com/enclosure.mp3; x:follow=yes / Why not an XML version of the HTML robots META tags - so we can also specify NOINDEX, NOARCHIVE as well as NOFOLLOW? Someone wrote up A Robots Processing Instruction for XML Documents http://atrus.org/writings/technical/robots_pi/spec-199912__/ That's a PI though, and I have no idea how well supported they are. I'd prefer a namespaced XML vocabulary. e.
Re: Don't Aggregrate Me
Eric Scheid wrote: On 30/8/05 11:19 AM, James M Snell [EMAIL PROTECTED] wrote: link rel=enclosure href=http://www.example.com/enclosure.mp3; x:follow=no / link rel=enclosure href=http://www.example.com/enclosure.mp3; x:follow=yes / content src=http://www.example.com/enclosure.mp3; x:follow=no / content src=http://www.example.com/enclosure.mp3; x:follow=yes / Why not an XML version of the HTML robots META tags - so we can also specify NOINDEX, NOARCHIVE as well as NOFOLLOW? Someone wrote up A Robots Processing Instruction for XML Documents http://atrus.org/writings/technical/robots_pi/spec-199912__/ That's a PI though, and I have no idea how well supported they are. I'd prefer a namespaced XML vocabulary. e. That's kinda where I was going with x:follow=no|yes. An x:archive=no|yes would also make some sense but could also be handled with HTTP caching (e.g. set the referenced content to expire immediately). x:index=no|yes doesn't seem to make a lot of sense in this case. x:follow=no|yes seems to me to be the only one that makes a lot of sense. - James
Re: Don't Aggregrate Me
On 30/8/05 12:05 PM, James M Snell [EMAIL PROTECTED] wrote: That's kinda where I was going with x:follow=no|yes. An x:archive=no|yes would also make some sense but could also be handled with HTTP caching (e.g. set the referenced content to expire immediately). x:index=no|yes doesn't seem to make a lot of sense in this case. x:follow=no|yes seems to me to be the only one that makes a lot of sense. x:index could be used to prevent purely ephemeral dross cluttering up the uber-aggregators. A feed which gives minute by minute weather data for example. robots NOARCHIVE is used by search engines, particularly google, to control whether they present a 'cached' page, which seems sensible. e.
Re: Don't Aggregrate Me
--On August 30, 2005 11:39:04 AM +1000 Eric Scheid [EMAIL PROTECTED] wrote: Someone wrote up A Robots Processing Instruction for XML Documents http://atrus.org/writings/technical/robots_pi/spec-199912__/ That's a PI though, and I have no idea how well supported they are. I'd prefer a namespaced XML vocabulary. That was me. I think it makes perfect sense as a PI. But I think reuse via namespaces is oversold. For example, we didn't even try to use Dublin Core tags in Atom. PI support is required by the XML spec -- must be passed to the application. wunder -- Walter Underwood Principal Software Architect, Verity
Re: Don't Aggregrate Me
--On August 29, 2005 7:05:09 PM -0700 James M Snell [EMAIL PROTECTED] wrote: x:index=no|yes doesn't seem to make a lot of sense in this case. It makes just as much sense as it does for HTML files. Maybe it is a whole group of Atom test cases. Maybe it is a feed of reboot times for the server. wunder -- Walter Underwood Principal Software Architect, Verity
Re: Don't Aggregrate Me
On 8/29/05, Walter Underwood [EMAIL PROTECTED] wrote: That was me. I think it makes perfect sense as a PI. But I think reuse via namespaces is oversold. For example, we didn't even try to use Dublin Core tags in Atom. Speak for yourself :) http://bitworking.org/news/Not_Invented_Here -joe -- Joe Gregoriohttp://bitworking.org
Re: Don't Aggregrate Me
Walter Underwood wrote: --On August 30, 2005 11:39:04 AM +1000 Eric Scheid [EMAIL PROTECTED] wrote: Someone wrote up A Robots Processing Instruction for XML Documents http://atrus.org/writings/technical/robots_pi/spec-199912__/ That's a PI though, and I have no idea how well supported they are. I'd prefer a namespaced XML vocabulary. That was me. I think it makes perfect sense as a PI. But I think reuse via namespaces is oversold. For example, we didn't even try to use Dublin Core tags in Atom. PI support is required by the XML spec -- must be passed to the application. The challenge here is that there is nothing which requires that PI's be persisted by the application. In other words, should an aggregator like pubsub.com preserve PI's in an Atom document when it aggregates entries on to end consumers? Where should the PI go? If an aggregator pulls in multple entries from multiple feeds, what should it do if those feeds have different nofollow, noindex and noarchive PI's? Also, is the PI reflective of the document in which they appear or the content that is linked to by the document? e.g. is it the atom:entry that shouldn't be indexed or the link href that shouldn't be indexed or both... or does putting the PI on the document level have a different meaning that putting it on the link level? etc etc Having x:index=yes|no, x:archive=yes|no, x:follow=yes|no attributes on the link and content elements provides a very simple mechanism that a) fits within the existing defined Atom extensibility model and b) is unambiguous in it's meaning. It also allows us to include atom:entry elements within SOAP Envelopes which are not allowed to carry processing instructions. -1 to using PI's for this. Let's not introduce a third way of extending Atom... with appologies to Monty Python: There are TWO ways of extending Atom link relations and namespaces... and PI's... There are THREE ways of extending Atom.. - James
Top 10 and other lists should be entries, not feeds.
Im sorry, but I cant go on without complaining. Microsoft has proposed extensions which turn RSS V2.0 feeds into lists and weve got folk who are proposing much the same for Atom (i.e. stateful, incremental or partitioned feeds) I think they are wrong. Feeds arent lists and Lists arent feeds. It seems to me that if you want a Top 10 list, then you should simply create an entry that provides your Top 10. Then, insert that entry in your feed so that the rest of us can read it. If you update the list, then just replace the entry in your feed. If you create a new list (Top 34?) then insert that in the feed along with the Top10 list. What is the problem? Why dont folk see that lists are the stuff of entries not feeds? Remember, Its about the entries, Stupid I think the reason weve got this pull to turn feeds into Lists is simply because we dont have a commonly accepted list schema. So, the idea is to repurpose what weve got. Folk are too scared or tired to try to get a new thing defined and through the process, so they figure that they will just overload the definition of something that already exists. I think thats wrong. If we want Lists then we should define lists and not muck about with Atom. If everyone is too tired to do the job properly and define a real list as a well defined schema for something that can be the payload of a content element, then why not just use OPML as the list format? What is a search engine or a matching engine supposed to return as a result if it find a match for a user query in an entry that comes from a list-feed? Should it return the entire feed or should it return just the entry/item that contained the stuff in the users query? What should an aggregating intermediary like PubSub do when it finds a match in an element of a list-feed? Is there some way to return an entire feed without building a feed of feeds? Given that no existing aggregator supports feeds as entries, how can an intermediary aggregator/filter return something the client will understand? You might say that the search/matching engine should only present the matching entry in its results. But, if you do that what happens is that you lose the important semantic data that comes from knowing the position the matched entry had in the original list-feed. There is no way to preserve that order-dependence information without private extensions at present. Im sorry but I simply cant see that it makes sense to encourage folk to break important rules of Atom by redefining feeds to be lists. If we want lists we should define what they look like and put them in entries. Keep your hands off the feeds. Feeds arent lists they are feeds. bob wyman