text/html with mode=xml in Atom 0.3

2006-03-23 Thread James Holderness


I've been seeing a number of feeds recently using Atom 0.3 with a content 
type of text/html and no mode attribute (i.e. the equivalent of 
mode=xml). However, the markup in that content is wrapped in a CDATA 
section, for example something like this:


   content type=text/html
 ![CDATA[div xmlns=http://www.w3.org/1999/xhtml;pContent goes 
here./p/div]]

   /content

If it had been marked as escaped you would obviously unescape the CDATA 
before interpreting the markup. However, since the mode is technically 
xml, I was under the impression that it should be treated as inline XML 
and no unescaping was necessary. But that would result in the literal text 
div xmlns=http://www.w3.org/1999/xhtml;pContent goes here/p/div 
being displayed to the user which is obviously not what is intended.


So is this a bug in the content generator (all the feeds I've seen appear to 
be using TypePad) or are you supposed to ignore the mode attribute when the 
content type is set to text/html and always treat it as escaped? I know 
Atom 0.3 is deprecated and I shouldn't be having to deal with this, but the 
reality of the situation is that there are a whole lot of Atom 0.3 feeds 
still out there (probably more than Atom 1.0) and I need to be able to 
support them.


Some feeds where you can see the problem (not all entries though):

http://feeds.feedburner.com/Flickrblog
http://dilbertblog.typepad.com/the_dilbert_blog/atom.xml
http://blog.cymfony.com/atom.xml

Regards
James



Re: atom:name ... text or html?

2006-03-23 Thread Anne van Kesteren


Quoting Eric Scheid [EMAIL PROTECTED]:

If I have an author with the name Bertrand Café, is it acceptable to put
that into atom:author like this;

   authorname![CDATA[Bertrand Cafeacute;]]/name/author

or should I be using the unicode numeric entity instead?


Even if it was HTML you couldn't really use the entity, could you? I 
think you

have to use a character reference or the actual character instead, yes.


--
Anne van Kesteren
http://annevankesteren.nl/




Re: atom:name ... text or html?

2006-03-23 Thread James M Snell

+1 to what Anne says.  If I received that Atom author name, I would
display it exactly as presented Bertrand Cafeacute;

- James

Anne van Kesteren wrote:
 
 Quoting Eric Scheid [EMAIL PROTECTED]:
 If I have an author with the name Bertrand Café, is it acceptable to
 put
 that into atom:author like this;

authorname![CDATA[Bertrand Cafeacute;]]/name/author

 or should I be using the unicode numeric entity instead?
 
 Even if it was HTML you couldn't really use the entity, could you? I
 think you
 have to use a character reference or the actual character instead, yes.
 
 



Re: atom:name ... text or html?

2006-03-23 Thread James Holderness


Hahaha! It's RSS all over again. In the words of Mark Pilgrim: Here's 
something that might be HTML. Or maybe not. I can't tell you, and you can't 
guess. :-)


Seriously though, the atom:name element is described as a human-readable 
name, so unless your name really is Betrand Cafeacture; that can't be 
right. If RFC4287 had intended to allow markup in the element it would have 
used atomTextConstruct.


Regards
James

Eric Scheid wrote:

If I have an author with the name Bertrand Café, is it acceptable to put
that into atom:author like this;

   authorname![CDATA[Bertrand Cafeacute;]]/name/author




Re: atom:name ... text or html?

2006-03-23 Thread A. Pagaltzis

* Eric Scheid [EMAIL PROTECTED] [2006-03-23 17:30]:
If I have an author with the name Bertrand Café, is it
acceptable to put that into atom:author like this;

authorname![CDATA[Bertrand Cafeacute;]]/name/author

No. That means the author’s name is Bertrand Cafeacute; (he must
have had very cruel parents), not Bertrand Café.

or should I be using the unicode numeric entity instead?

Yes. Or use a literal é as you did in this mail, provided you
emit the feed as UTF-8 (or ISO-8859-1, if you must).

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/



Re: text/html with mode=xml in Atom 0.3

2006-03-23 Thread A. Pagaltzis

* James Holderness [EMAIL PROTECTED] [2006-03-23 17:30]:
So is this a bug in the content generator (all the feeds I've
seen appear to be using TypePad)

Yes.

or are you supposed to ignore the mode attribute when the
content type is set to text/html and always treat it as
escaped?

No.

In 0.3, the `mode` attribute was the final arbiter for the form
of the content. In Atom 1.0, its role was subsumed by switching
on the `type` value because consumer developers reported that
this sort of layering was unnecessarily hard to support and
provided no discernible benefit.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/



Re: atom:name ... text or html?

2006-03-23 Thread Sylvain Hellegouarch





Seriously though, the atom:name element is described as a 
human-readable name, 
Do you mean that human-readable is equivalent to solely English? 
Because as a French, having accents in names is so natural that I see it 
as human readable too ;)


- Sylvain




Re: atom:name ... text or html?

2006-03-23 Thread James Holderness


Sylvain Hellegouarch wrote:
Do you mean that human-readable is equivalent to solely English? Because 
as a French, having accents in names is so natural that I see it as human 
readable too ;)


No. I mean that the literal sequence of characters  e a c u t e ; is not 
human-readable (or at least isn't intended to be).


Regards
James



Re: atom:name ... text or html?

2006-03-23 Thread Stephane Bortzmeyer

On Fri, Mar 24, 2006 at 03:16:18AM +1100,
 Eric Scheid [EMAIL PROTECTED] wrote 
 a message of 10 lines which said:

 or should I be using the unicode numeric entity instead?

Or the character itself, in UTF-8 or any other encoding (but UTF-8 is
the most widely implemented, so you limit the risks).

(That's what I do with http://www.bortzmeyer.org/feed.atom and it
seems OK in every agregator and it validates.)



Re: atom:name ... text or html?

2006-03-23 Thread David Powell


Thursday, March 23, 2006, 4:57:11 PM, you wrote:

 On 24/3/06 3:21 AM, Anne van Kesteren [EMAIL PROTECTED] wrote:

 authorname![CDATA[Bertrand Cafeacute;]]/name/author
 
 Even if it was HTML you couldn't really use the entity, could you? I think
 you have to use a character reference or the actual character instead, yes.
 

 It's true that XML has only a half dozen or so entities defined, meaning
 most interesting entities from html can't exist in XML ... unless maybe they
 are wrapped like in CDATA block like above?

atom:name is not intended to contain HTML, the spec for it doesn't
mention HTML, it is no more correct to put HTML in it, than it is to
put base64'd PDF in there.

 I'm getting the data by scraping an html page, so I'm expecting it to be
 acceptable html code, including html entities.

Your HTML parser should decode the entities for you and return a
string. Your Atom generator should encode or escape the string using
numeric entities.

If you really need to use HTML entities directly, then you could put:

!DOCTYPE feed [
!ENTITY eacute #233;
]

at the top of your feed and get rid of that CDATA. XML processors are
REQUIRED [1] to process internal DTD subsets.

[Hmm, internal DTD subsets completely fail in IE7's feed reader,
throwing up a friendly error message]

[1] http://www.w3.org/TR/2004/REC-xml-20040204/#proc-types

-- 
Dave



Re: atom:name ... text or html?

2006-03-23 Thread Stephane Bortzmeyer

On Thu, Mar 23, 2006 at 05:01:03PM +0100,
 Sylvain Hellegouarch [EMAIL PROTECTED] wrote 
 a message of 11 lines which said:

 Because as a French, having accents in names is so natural that I
 see it as human readable too ;)

As I wrote and used and tested on my blog, there is no problem in Atom
to have a first name with accent like mine. Atom is XML and therefore
Unicode rules.



Re: text/html with mode=xml in Atom 0.3

2006-03-23 Thread James Holderness


A. Pagaltzis wrote:

So is this a bug in the content generator (all the feeds I've
seen appear to be using TypePad)


Yes.


or are you supposed to ignore the mode attribute when the
content type is set to text/html and always treat it as
escaped?


No.


Thanks for the confirmation. I was beginning to think I was wrong. I tested 
this in 15 different aggregators and all but one ignored the mode and 
unescaped the content anyway. I have a horrible feeling I'm going to have to 
add code to emulate this behaviour.


Regards
James



Re: atom:name ... text or html?

2006-03-23 Thread A. Pagaltzis

* Eric Scheid [EMAIL PROTECTED] [2006-03-23 18:05]:
It's true that XML has only a half dozen or so entities defined,
meaning most interesting entities from html can't exist in XML
... unless maybe they are wrapped like in CDATA block like
above?

No, a CDATA block simply means that characters like ,  and 
stand for themselves.

I'm getting the data by scraping an html page, so I'm expecting
it to be acceptable html code, including html entities.

Then decode the entities to a Unicode string and emit the feed as
Unicode. Simplest thing that will work reliably.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/



Re: atom:name ... text or html?

2006-03-23 Thread A. Pagaltzis

* Sylvain Hellegouarch [EMAIL PROTECTED] [2006-03-23 18:15]:
Do you mean that human-readable is equivalent to solely
English? Because as a French, having accents in names is so
natural that I see it as human readable too ;)

Even as a French, you probably write é, not eacute;. :-)

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/



Re: atom:name ... text or html?

2006-03-23 Thread Antone Roundy


On Mar 23, 2006, at 9:48 AM, James Holderness wrote:
Hahaha! It's RSS all over again. In the words of Mark Pilgrim:  
Here's something that might be HTML. Or maybe not. I can't tell  
you, and you can't guess. :-)


Seriously though, the atom:name element is described as a human- 
readable name, so unless your name really is Betrand  
Cafeacture; that can't be right. If RFC4287 had intended to allow  
markup in the element it would have used atomTextConstruct.


I agree with James here--if we had intended for the name to be able  
to include markup, we should have used the construct we created to  
allow that.  This from RFC 4287 (section 3.2):


   element atom:name { text }

would have been this:

   element atom:name { atomTextConstruct }

if we had intended for it to be able to contain anything but literal  
text after XML un-escaping, right?


On Mar 23, 2006, at 9:57 AM, Eric Scheid wrote:
It's true that XML has only a half dozen or so entities defined,  
meaning
most interesting entities from html can't exist in XML ... unless  
maybe they

are wrapped like in CDATA block like above?
If they're wrapped in a CDATA block, then they don't trigger an XML  
parsing error, but wrapping something in CDATA isn't a license to  
enter data in a format other than what the RFC allows.


I'm getting the data by scraping an html page, so I'm expecting it  
to be

acceptable html code, including html entities.
You, the producer, are getting the data from an HTML page, so you  
should certainly be prepared to handle HTML entities in it. But you  
the Atom publisher are responsible for making sure that you've made  
any changes to the data that are necessary for it to be proper Atom  
before you publish it. The consumer of the Atom feed doesn't know  
where you got the data, and thus can't be expected to decide how to  
process it based on where you got it.




Re: text/html with mode=xml in Atom 0.3

2006-03-23 Thread A. Pagaltzis

* James Holderness [EMAIL PROTECTED] [2006-03-23 18:40]:
I tested this in 15 different aggregators and all but one
ignored the mode and unescaped the content anyway.

Good thing this rule was changed in Atom 1.0, then…

What I really don’t get is what that `xmlns` attribute is doing
there in the CDATA block of your data sample. Sometimes I wonder
if CDATA should not have been left out of the XML spec; it seems
to create far too much confusion to be worthwhile.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/



Re: atom:name ... text or html?

2006-03-23 Thread James Holderness


David Powell wrote:

[Hmm, internal DTD subsets completely fail in IE7's feed reader,
throwing up a friendly error message]


If I remember correctly they considered that a feature. Something to do with 
DTDs being a security risk. I'm not sure if this also meant they were 
incapable of processing Netscape RSS 0.91 feeds. All I know is that if I 
ever have a blog, I'll be sure to include a DTD at the top of my feed.


Regards
James



Does xml:base apply to type=html content?

2006-03-23 Thread David Powell


xml:base applies to type=xhtml content, but I'm not sure whether it
is supposed to apply to escaped type=html content? I reckon that it
does.

Anybody came across this? Any opinions?

-- 
Dave



Re: text/html with mode=xml in Atom 0.3

2006-03-23 Thread James Holderness


A. Pagaltzis wrote:

What I really don’t get is what that `xmlns` attribute is doing
there in the CDATA block of your data sample. Sometimes I wonder
if CDATA should not have been left out of the XML spec; it seems
to create far too much confusion to be worthwhile.


Well if you look at some of those feeds I listed, many of the entries are 
type=application/xhtml+xml with a namespaced div element as you would 
expect. It looks like they may have taken the exact same code (or template, 
or however it is they do this stuff) and reused it for type=text/html. 
Only with the html they decided they should wrap everything in a CDATA block 
just to be safe.


Regards
James



Re: atom:name ... text or html?

2006-03-23 Thread Tim Bray



On Mar 23, 2006, at 8:01 AM, Sylvain Hellegouarch wrote:






Seriously though, the atom:name element is described as a human- 
readable name,
Do you mean that human-readable is equivalent to solely English?  
Because as a French, having accents in names is so natural that I  
see it as human readable too ;)


You can have accents, you just can't use HTML entities to get them. -Tim



Re: atom:name ... text or html?

2006-03-23 Thread Tim Bray



On Mar 23, 2006, at 8:57 AM, Eric Scheid wrote:



On 24/3/06 3:21 AM, Anne van Kesteren [EMAIL PROTECTED]  
wrote:



authorname![CDATA[Bertrand Cafeacute;]]/name/author

Even if it was HTML you couldn't really use the entity, could  
you? I think
you have to use a character reference or the actual character  
instead, yes.




It's true that XML has only a half dozen or so entities defined


To be precise, 5: lt; amp; gt; apos; quot; -Tim



Re: atom:name ... text or html?

2006-03-23 Thread Tim Bray


On Mar 23, 2006, at 8:16 AM, Eric Scheid wrote:

If I have an author with the name Bertrand Café, is it acceptable  
to put

that into atom:author like this;

authorname![CDATA[Bertrand Cafeacute;]]/name/author

or should I be using the unicode numeric entity instead?


The key point is that the atom:name element, described in RFC4287  
3.2.1, is not a Text Construct, as defined in 3.1, so you can't say  
atom:name type=html; so no markup allowed.  So just say Bertrand  
Café.  -Tim





Re: Atom Thread Feed syntax

2006-03-23 Thread James M Snell

Just wanted to follow through on this for everyone.  Given that there
are vendors getting ready to ship code based on the current rev of the
spec, I'm *not* going to rename the id attribute to ref.  Yes, I
know that id is confusing to some folks, but we're just talking the
name of a single attribute and not a critical functional bug.  From this
point forward, only critical spec bugs will be fixed and I will be
submitting the spec for consideration as a standards track RFC in the
not too distant future.

- James

Sylvain Hellegouarch wrote:
 
 Hi everyone,
 
 I was reading the Atom Feed Thread draft [1] yesterday and I ran into a
 problem as I described in my blog [2]. To recap the 'in-reply-to'
 element defined in that specification takes an 'id' attribute that
 specifies /the universally unique identifier of the resource being
 responded to/.
 
 Calling such an attribute 'id' is a mistake in my opinion as it confuses
 with the actual ID of the element itself within the XML document it
 belongs to and it makes impossible for another element within the
 document to have the same value as an 'id'. I would rather move the
 content of that attribute as a text element of the 'in-reply-to' element
 (as does the atom:id element).
 
 Thoughts?
 - Sylvain
 
 [1]
 http://www.ietf.org/internet-drafts/draft-snell-atompub-feed-thread-05.txt
 [2] http://www.defuze.org/archives/2006/03/14/about-atom-feed-threads
 
 



Re: atom:name ... text or html?

2006-03-23 Thread Eric Scheid

On 24/3/06 4:42 AM, A. Pagaltzis [EMAIL PROTECTED] wrote:

 I'm getting the data by scraping an html page, so I'm expecting
 it to be acceptable html code, including html entities.
 
 Then decode the entities to a Unicode string and emit the feed as
 Unicode. Simplest thing that will work reliably.

I figured as much. Oh well, now to track down a list of html entities and
their corresponding unicodes ...

e.



Re: atom:name ... text or html?

2006-03-23 Thread Tim Bray



On Mar 23, 2006, at 2:20 PM, Eric Scheid wrote:


Oh well, now to track down a list of html entities and
their corresponding unicodes ...


http://www.google.com/search?q=xhtml%20entities



Re: atom:name ... text or html?

2006-03-23 Thread A. Pagaltzis

* Eric Scheid [EMAIL PROTECTED] [2006-03-23 23:30]:
Oh well, now to track down a list of html entities and their
corresponding unicodes ...

That would be in the spec.
http://www.w3.org/TR/REC-html40/sgml/entities.html

But you shouldn’t have to. Any self-respecting language has a
library for that somewhere.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/



Re: Atom Thread Feed syntax

2006-03-23 Thread David Powell


Thursday, March 23, 2006, 9:39:09 PM, James M Snell wrote:

 Just wanted to follow through on this for everyone.  Given that there
 are vendors getting ready to ship code based on the current rev of the
 spec, I'm *not* going to rename the id attribute to ref.  Yes, I
 know that id is confusing to some folks, but we're just talking the
 name of a single attribute and not a critical functional bug.  From this
 point forward, only critical spec bugs will be fixed and I will be
 submitting the spec for consideration as a standards track RFC in the
 not too distant future.

I'm more bothered about the use of undefined markup on the link
element. I know, I know, I keep going on and on about this, but I keep
seeing more drafts that do the same thing and it isn't just a
theoretical problem: Windows Feed Platform does not preserve arbitrary
markup other than proper extension elements. Other feed stores and
servers are likely to do the same (justifiably IMO).

The abandonment of extension constructs in favour of undefined markup
by this draft, and other draft-*-atompub-* drafts would be an
interoperability concern if these drafts were deployed. If you want to
extend Atom, use Extension Elements.

-- 
Dave



Re: Atom Thread Feed syntax

2006-03-23 Thread A. Pagaltzis

* David Powell [EMAIL PROTECTED] [2006-03-24 02:20]:
The abandonment of extension constructs in favour of undefined
markup by this draft, and other draft-*-atompub-* drafts would
be an interoperability concern if these drafts were deployed. If
you want to extend Atom, use Extension Elements.

I don’t follow. Please explain how these drafts fail to satisfy
the criteria in Section 6.4.2, Structured Extension Elements.

Regards,
-- 
Aristotle Pagaltzis // http://plasmasturm.org/



Re: Atom Thread Feed syntax

2006-03-23 Thread James M Snell

I believe the concern is over the thr:count and thr:when attributes for
the replies link relation, both of which are optional, and both of which
provide what I consider to be extra information.  In other words, it's
ok if an implementation drops them.  The important bit is the
in-reply-to element and the replies link rel, both of which fall within
the bounds of the Atom extension model.

- James

A. Pagaltzis wrote:
 * David Powell [EMAIL PROTECTED] [2006-03-24 02:20]:
 The abandonment of extension constructs in favour of undefined
 markup by this draft, and other draft-*-atompub-* drafts would
 be an interoperability concern if these drafts were deployed. If
 you want to extend Atom, use Extension Elements.
 
 I don’t follow. Please explain how these drafts fail to satisfy
 the criteria in Section 6.4.2, Structured Extension Elements.
 
 Regards,



Re: Atom Thread Feed syntax

2006-03-23 Thread James M Snell


David Powell wrote:
[snip]
 The abandonment of extension constructs in favour of undefined markup
 by this draft, and other draft-*-atompub-* drafts would be an
 interoperability concern if these drafts were deployed. If you want to
 extend Atom, use Extension Elements.
 

I'm most certainly not abandoning the extension constructs.  One of the
motivations for walking these extension specs through the I-D and
eventually standards-track process is so that they get their own RFC
number.  Implementations that choose to support the extension can point
to RFC4287 *and* RFCwhatever and say, I support both.  If an
implementation only says I support RFC4287 and doesn't say anything
about RFCwhatever, it's pretty clear what the result would be.

The most an RFC4287 implementation should be expected to do is adhere to
the defined extension model.  If that implementation also chooses to
support other RFC's that go beyond that extension model, so be it.

That said, the critical parts of the Feed Thread draft (the in-reply-to
element and the replies link rel) follow the guidelines of the Atom
extension model.  That is, any RFC4287 implementation *should* be able
to do something with those elements (even if it's just preserving them).
 The optional parts of the extension (thr:count an thr:when) fall
outside of the Atom extension model.  That's ok.  Implementations can
choose to ignore those things, even completely drop them.

As for the other extension drafts I put out, keep in mind that most
should be considered strictly experimental at this time.  That said,
there is really only one that really falls outside the extension model..
the Link Extensions draft [1]... which, by definition cannot adhere to
the extension model given the fact that Atom link elements are actually
not extensible.

[1]
http://www.ietf.org/internet-drafts/draft-snell-atompub-link-extensions-02.txt

- James