Re: Don't Aggregrate Me

2005-08-29 Thread Mark Pilgrim

On 8/26/05, Graham [EMAIL PROTECTED] wrote:
  (And before you say but my aggregator is nothing but a podcast
  client, and the feeds are nothing but links to enclosures, so it's
  obvious that the publisher wanted me to download them -- WRONG!  The
  publisher might want that, or they might not ...
 
 So you're saying browsers should check robots.txt before downloading
 images?

It's sad that such an inane dodge would even garner any attention at
all, much less require a response.

http://www.robotstxt.org/wc/faq.html


What is a WWW robot?
A robot is a program that automatically traverses the Web's hypertext
structure by retrieving a document, and recursively retrieving all
documents that are referenced.

Note that recursive here doesn't limit the definition to any
specific traversal algorithm; even if a robot applies some heuristic
to the selection and order of documents to visit and spaces out
requests over a long space of time, it is still a robot.

Normal Web browsers are not robots, because they are operated by a
human, and don't automatically retrieve referenced documents (other
than inline images).

Web robots are sometimes referred to as Web Wanderers, Web Crawlers,
or Spiders. These names are a bit misleading as they give the
impression the software itself moves between sites like a virus; this
not the case, a robot simply visits sites by requesting documents from
them.


On a more personal note, I would like to thank you for reminding me
why there will never be an Atom Implementor's Guide. 
http://diveintomark.org/archives/2004/08/16/specs

-- 
Cheers,
-Mark



Re: Don't Aggregrate Me

2005-08-29 Thread Antone Roundy


On Monday, August 29, 2005, at 10:12  AM, Mark Pilgrim wrote:

On 8/26/05, Graham [EMAIL PROTECTED] wrote:

(And before you say but my aggregator is nothing but a podcast
client, and the feeds are nothing but links to enclosures, so it's
obvious that the publisher wanted me to download them -- WRONG!  The
publisher might want that, or they might not ...


So you're saying browsers should check robots.txt before downloading
images?

...

Normal Web browsers are not robots, because they are operated by a
human, and don't automatically retrieve referenced documents (other
than inline images).


As has been suggested, to inline images, we need to add frame 
documents, stylesheets, Java applets, external JavaScript code, objects 
such as Flash files, etc., etc., etc.  The question is, with respect to 
feed readers, do external feed content (content src=... /), 
enclosures, etc. fall into the same exceptions category or not?  If 
not, then what's the best mechanism for telling feed readers whether 
they can download them automatically--robots.txt, another file like 
robots.txt, or something in the XML?  I'd prefer something in the XML.  
A possibility:


feed
ext:auto-download target=enclosures default=false /
ext:auto-download target=content default =true /
...
entry
link rel=enclosure href=... ext:auto-download=yes /
content src=... ext:auto-download=0 /
...



Re: Don't Aggregrate Me

2005-08-29 Thread Antone Roundy


On Monday, August 29, 2005, at 10:39  AM, Antone Roundy wrote:

ext:auto-download target=enclosures default=false /

More robust would be:
ext:auto-download target=[EMAIL PROTECTED]'enclosure'] default=false 
/
...enabling extension elements to be named in @target without requiring 
a list of @target values to be maintained anywhere.




Re: Don't Aggregrate Me

2005-08-29 Thread A. Pagaltzis

* Antone Roundy [EMAIL PROTECTED] [2005-08-29 19:00]:
 More robust would be:
   ext:auto-download target=[EMAIL PROTECTED]'enclosure'] 
 default=false /
 ...enabling extension elements to be named in @target without
 requiring a list of @target values to be maintained anywhere.

Is it wise to require either XPath support in consumers or to
formulate a hackneyed XPath subset specifically for this purpose?
And what about namespaced elements? And what about intermediaries
which transcribe the content into a document with different NS
prefixes?

I think sticking to just an @ext:auto-download attribute
applicable to single elements is the wise thing to do.

Of course, I wonder if we can’t simply use @xlink:type for the
purpose… (I admit ignorance of the specifics of XLink, so this
idea might be useless.)

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/



Re: Don't Aggregrate Me

2005-08-29 Thread Walter Underwood


--On Monday, August 29, 2005 10:39:33 AM -0600 Antone Roundy [EMAIL 
PROTECTED] wrote:


As has been suggested, to inline images, we need to add frame documents,
stylesheets, Java applets, external JavaScript code, objects such as Flash
files, etc., etc., etc.  The question is, with respect to feed readers, do
external feed content (content src=... /), enclosures, etc. fall into
the same exceptions category or not?


Of course a feed reader can read the feed, and anything required
to make it readable. Duh.

And all this time, I thought robots.txt was simple.

robots.txt is a polite hint from the publisher that a robot (not
a human) probably should avoid those URLs. Humans can do any stupid
thing they want, and probably will.

The robots.txt spec is silent on what to do with URLs manually-added
to a robot. The normal approach is to deny those, with a message that they
are disallowed by robots.txt, and offer some way to override that.

wunder
--
Walter Underwood
Principal Architect
Verity Ultraseek



Re: Don't Aggregrate Me

2005-08-29 Thread A. Pagaltzis

* Mark Pilgrim [EMAIL PROTECTED] [2005-08-29 18:20]:
 On 8/26/05, Graham [EMAIL PROTECTED] wrote:
  So you're saying browsers should check robots.txt before
  downloading images?
 
 It's sad that such an inane dodge would even garner any
 attention at all, much less require a response.

I’m with you on how robots.txt is to be interpreted, but to a
point there is a point to the dodge. F.ex, your example of
pointing an enclosure to a large file on a foreign server in
order to perform a DoS against it is equally practicable by
pointing an img src= to it from a high traffic site.

The distinction between what’s inline content and what’s not
really is more arbitrary than inherent.

Of course, that’s just splitting hairs, since it doesn’t actually
make a difference to the interpretation. Crawlers generally don’t
traverse img/@src references, and the few that do, such as
Google’s and Yahoo’s image search services, respect robots.txt.

Further, aggregation services do not retrieve images referenced
in the content of the feeds they consume. So why should they
retrieve enclosures?

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/



Re: Don't Aggregrate Me

2005-08-29 Thread Karl Dubost



Le 05-08-26 à 18:59, Bob Wyman a écrit :

Karl, Please, accept my apologies for this. I could have sworn we
had the policy prominently displayed on the site. I know we used to  
have it
there. This must have been lost when we did a site redesign last  
November!
I'm really surprised that it has taken this long to notice that it  
is gone.

I'll see that we get it back up.


Thank you very much for your honest answer. Much appreciated.


You see educating users is not obvious it seems ;) No offense, it
just shows that it is not an easy accessible information. And
there's a need to educate Services too.


Point taken. I'll get it fixed. It's a weekend now. Give me a few
days... I'm not sure, but I think it makes sense to put this on the
add-feed page at: http://www.pubsub.com/add_feed.php . Do you agree?


Yes I guess a warning here plus a way of saying to users that they  
can change their mind later on might be useful.



Yes, forged pings or unauthorized third-party pings are a real
issue. Unfortunately, the current design of the pinging system  
gives us
absolutely no means to determine if a ping is authorized by the  
publisher.


Exact. We will run into Identification problems if we go further,  
with big privacy issues.


I argued last year that we should develop a blogging or  
syndication
architecture document in much the same way that the TAG documented  
the web

architecture and in the way that most decent standards groups usually
produce some sort of reference architecture document.


Yes I remember that. I remember you talking about it at the New-York  
meeting, we had in May 2004.




Some solutions, like requiring that pings be signed would work
from a technical point of view, but are probably not practical  
except in
some limited cases. (e.g. Signatures may make sense as a way to  
enable Fat

Pings from small or personal blog sites.


The thing is that a unique solution will not be enough.

I may want to be able
- to “authorize” services A, B and C to do things with my content,
- but to forbid services X, Y and Z to use my content.

Right now it's very hard to do that, except if you are a geek and you  
can block bots by their IP address and hoping that this IP will not  
change. It's why I would think that having services respecting  
license of contents would be a first step.


In a service which aggregates the news from different sources. Some  
of the sources might be licensed for commercial use and some others  
not at all.  Flickr has the start of a very interesting  
acknowledgement of that somehow.


http://www.flickr.com/creativecommons/

* Maybe services like PubSub, Technorati, Bloglines, etc. should  
display the license in the search results. That would be a first step.
* Second step would be to not use the content in a commercial  
activity if it has been marked as such. (data mining, marketing  
profile, etc.)



--
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager
*** Be Strict To Be Cool ***





Re: Don't Aggregrate Me

2005-08-29 Thread James M Snell


link rel=enclosure href=http://www.example.com/enclosure.mp3; 
x:follow=no /
link rel=enclosure href=http://www.example.com/enclosure.mp3; 
x:follow=yes /


content src=http://www.example.com/enclosure.mp3; x:follow=no /
content src=http://www.example.com/enclosure.mp3; x:follow=yes /

???

- James

A. Pagaltzis wrote:


* Antone Roundy [EMAIL PROTECTED] [2005-08-29 19:00]:
 


More robust would be:
ext:auto-download target=[EMAIL PROTECTED]'enclosure'] default=false 
/
...enabling extension elements to be named in @target without
requiring a list of @target values to be maintained anywhere.
   



Is it wise to require either XPath support in consumers or to
formulate a hackneyed XPath subset specifically for this purpose?
And what about namespaced elements? And what about intermediaries
which transcribe the content into a document with different NS
prefixes?

I think sticking to just an @ext:auto-download attribute
applicable to single elements is the wise thing to do.

Of course, I wonder if we can’t simply use @xlink:type for the
purpose… (I admit ignorance of the specifics of XLink, so this
idea might be useless.)

Regards,
 





Re: Don't Aggregrate Me

2005-08-29 Thread Eric Scheid

On 30/8/05 11:19 AM, James M Snell [EMAIL PROTECTED] wrote:

 link rel=enclosure href=http://www.example.com/enclosure.mp3;
 x:follow=no /
 link rel=enclosure href=http://www.example.com/enclosure.mp3;
 x:follow=yes /
 
 content src=http://www.example.com/enclosure.mp3; x:follow=no /
 content src=http://www.example.com/enclosure.mp3; x:follow=yes /

Why not an XML version of the HTML robots META tags - so we can also specify
NOINDEX, NOARCHIVE as well as NOFOLLOW?

Someone wrote up A Robots Processing Instruction for XML Documents
http://atrus.org/writings/technical/robots_pi/spec-199912__/
That's a PI though, and I have no idea how well supported they are. I'd
prefer a namespaced XML vocabulary.


e.



Re: Don't Aggregrate Me

2005-08-29 Thread James M Snell


Eric Scheid wrote:


On 30/8/05 11:19 AM, James M Snell [EMAIL PROTECTED] wrote:

 


link rel=enclosure href=http://www.example.com/enclosure.mp3;
x:follow=no /
link rel=enclosure href=http://www.example.com/enclosure.mp3;
x:follow=yes /

content src=http://www.example.com/enclosure.mp3; x:follow=no /
content src=http://www.example.com/enclosure.mp3; x:follow=yes /
   



Why not an XML version of the HTML robots META tags - so we can also specify
NOINDEX, NOARCHIVE as well as NOFOLLOW?

Someone wrote up A Robots Processing Instruction for XML Documents
   http://atrus.org/writings/technical/robots_pi/spec-199912__/
That's a PI though, and I have no idea how well supported they are. I'd
prefer a namespaced XML vocabulary.


e.


 

That's kinda where I was going with x:follow=no|yes.  An 
x:archive=no|yes would also make some sense but could also be handled 
with HTTP caching (e.g. set the referenced content to expire 
immediately).  x:index=no|yes doesn't seem to make a lot of sense in 
this case.  x:follow=no|yes seems to me to be the only one that makes 
a lot of sense.


- James




Re: Don't Aggregrate Me

2005-08-29 Thread Eric Scheid

On 30/8/05 12:05 PM, James M Snell [EMAIL PROTECTED] wrote:

 That's kinda where I was going with x:follow=no|yes.  An
 x:archive=no|yes would also make some sense but could also be handled
 with HTTP caching (e.g. set the referenced content to expire
 immediately).  x:index=no|yes doesn't seem to make a lot of sense in
 this case.  x:follow=no|yes seems to me to be the only one that makes
 a lot of sense.

x:index could be used to prevent purely ephemeral dross cluttering up the
uber-aggregators. A feed which gives minute by minute weather data for
example.

robots NOARCHIVE is used by search engines, particularly google, to control
whether they present a 'cached' page, which seems sensible.

e. 



Re: Don't Aggregrate Me

2005-08-29 Thread Walter Underwood

--On August 30, 2005 11:39:04 AM +1000 Eric Scheid [EMAIL PROTECTED] wrote:

 Someone wrote up A Robots Processing Instruction for XML Documents
 http://atrus.org/writings/technical/robots_pi/spec-199912__/
 That's a PI though, and I have no idea how well supported they are. I'd
 prefer a namespaced XML vocabulary.

That was me. I think it makes perfect sense as a PI. But I think reuse
via namespaces is oversold. For example, we didn't even try to use
Dublin Core tags in Atom.

PI support is required by the XML spec -- must be passed to the
application.

wunder
--
Walter Underwood
Principal Software Architect, Verity



Re: Don't Aggregrate Me

2005-08-29 Thread Walter Underwood

--On August 29, 2005 7:05:09 PM -0700 James M Snell [EMAIL PROTECTED] wrote:

 x:index=no|yes doesn't seem to make a lot of sense in this case.

It makes just as much sense as it does for HTML files. Maybe it is a
whole group of Atom test cases. Maybe it is a feed of reboot times 
for the server.

wunder
--
Walter Underwood
Principal Software Architect, Verity



Re: Don't Aggregrate Me

2005-08-29 Thread Joe Gregorio

On 8/29/05, Walter Underwood [EMAIL PROTECTED] wrote:
 That was me. I think it makes perfect sense as a PI. But I think reuse
 via namespaces is oversold. For example, we didn't even try to use
 Dublin Core tags in Atom.

Speak for yourself :)
 
 http://bitworking.org/news/Not_Invented_Here

  -joe

-- 
Joe Gregoriohttp://bitworking.org



Re: Don't Aggregrate Me

2005-08-29 Thread James M Snell


Walter Underwood wrote:


--On August 30, 2005 11:39:04 AM +1000 Eric Scheid [EMAIL PROTECTED] wrote:
 


Someone wrote up A Robots Processing Instruction for XML Documents
   http://atrus.org/writings/technical/robots_pi/spec-199912__/
That's a PI though, and I have no idea how well supported they are. I'd
prefer a namespaced XML vocabulary.
   



That was me. I think it makes perfect sense as a PI. But I think reuse
via namespaces is oversold. For example, we didn't even try to use
Dublin Core tags in Atom.

PI support is required by the XML spec -- must be passed to the
application.

 

The challenge here is that there is nothing which requires that PI's be 
persisted by the application.  In other words, should an aggregator like 
pubsub.com preserve PI's in an Atom document when it aggregates entries 
on to end consumers?  Where should the PI go?  If an aggregator pulls in 
multple entries from multiple feeds, what should it do if those feeds 
have different nofollow, noindex and noarchive PI's?  Also, is the PI 
reflective of the document in which they appear or the content that is 
linked to by the document? e.g. is it the atom:entry that shouldn't be 
indexed or the link href that shouldn't be indexed or both... or does 
putting the PI on the document level have a different meaning that 
putting it on the link level?  etc etc


Having x:index=yes|no, x:archive=yes|no, x:follow=yes|no 
attributes on the link and content elements provides a very simple 
mechanism that a) fits within the existing defined Atom extensibility 
model and b) is unambiguous in it's meaning.  It also allows us to 
include atom:entry elements within SOAP Envelopes which are not allowed 
to carry processing instructions. 

-1 to using PI's for this.  Let's not introduce a third way of extending 
Atom... with appologies to Monty Python: There are TWO ways of 
extending Atom link relations and namespaces... and PI's... There 
are THREE ways of extending Atom..


- James



Top 10 and other lists should be entries, not feeds.

2005-08-29 Thread Bob Wyman








Im sorry, but I cant
go on without complaining. Microsoft has proposed extensions which turn
RSS V2.0 feeds into lists and weve got folk who are proposing much the
same for Atom (i.e. stateful, incremental or partitioned feeds) I think
they are wrong. Feeds arent lists and Lists arent feeds. It seems
to me that if you want a Top 10 list, then you should simply
create an entry that provides your Top 10. Then, insert that entry in your feed
so that the rest of us can read it. If you update the list, then just replace
the entry in your feed. If you create a new list (Top 34?) then insert that in
the feed along with the Top10 list. 

What is the problem? Why dont
folk see that lists are the stuff of entries  not feeds? Remember, Its
about the entries, Stupid

I think the reason weve got
this pull to turn feeds into Lists is simply because we dont have a
commonly accepted list schema. So, the idea is to repurpose what
weve got. Folk are too scared or tired to try to get a new thing defined
and through the process, so they figure that they will just overload the
definition of something that already exists. I think thats wrong. If we want
Lists then we should define lists and not muck about with Atom.
If everyone is too tired to do the job properly and define a real list as a
well defined schema for something that can be the payload of a content element,
then why not just use OPML as the list format?



What is a search engine or a
matching engine supposed to return as a result if it find a match for a user
query in an entry that comes from a list-feed? Should it return the entire feed
or should it return just the entry/item that contained the stuff in the users
query? What should an aggregating intermediary like PubSub do when it finds a
match in an element of a list-feed? Is there some way to return an entire feed
without building a feed of feeds? Given that no existing aggregator supports
feeds as entries, how can an intermediary aggregator/filter return something the
client will understand? 

You might say that the
search/matching engine should only present the matching entry in its results.
But, if you do that what happens is that you lose the important semantic data
that comes from knowing the position the matched entry had in the original
list-feed. There is no way to preserve that order-dependence information without
private extensions at present.

Im sorry but I simply cant
see that it makes sense to encourage folk to break important rules of Atom by
redefining feeds to be lists. If we want lists we should define
what they look like and put them in entries. Keep your hands off the feeds.
Feeds arent lists  they are feeds.



 bob
wyman