text/html with mode=xml in Atom 0.3
I've been seeing a number of feeds recently using Atom 0.3 with a content type of text/html and no mode attribute (i.e. the equivalent of mode=xml). However, the markup in that content is wrapped in a CDATA section, for example something like this: content type=text/html ![CDATA[div xmlns=http://www.w3.org/1999/xhtml;pContent goes here./p/div]] /content If it had been marked as escaped you would obviously unescape the CDATA before interpreting the markup. However, since the mode is technically xml, I was under the impression that it should be treated as inline XML and no unescaping was necessary. But that would result in the literal text div xmlns=http://www.w3.org/1999/xhtml;pContent goes here/p/div being displayed to the user which is obviously not what is intended. So is this a bug in the content generator (all the feeds I've seen appear to be using TypePad) or are you supposed to ignore the mode attribute when the content type is set to text/html and always treat it as escaped? I know Atom 0.3 is deprecated and I shouldn't be having to deal with this, but the reality of the situation is that there are a whole lot of Atom 0.3 feeds still out there (probably more than Atom 1.0) and I need to be able to support them. Some feeds where you can see the problem (not all entries though): http://feeds.feedburner.com/Flickrblog http://dilbertblog.typepad.com/the_dilbert_blog/atom.xml http://blog.cymfony.com/atom.xml Regards James
Re: atom:name ... text or html?
Quoting Eric Scheid [EMAIL PROTECTED]: If I have an author with the name Bertrand Café, is it acceptable to put that into atom:author like this; authorname![CDATA[Bertrand Cafeacute;]]/name/author or should I be using the unicode numeric entity instead? Even if it was HTML you couldn't really use the entity, could you? I think you have to use a character reference or the actual character instead, yes. -- Anne van Kesteren http://annevankesteren.nl/
Re: atom:name ... text or html?
+1 to what Anne says. If I received that Atom author name, I would display it exactly as presented Bertrand Cafeacute; - James Anne van Kesteren wrote: Quoting Eric Scheid [EMAIL PROTECTED]: If I have an author with the name Bertrand Café, is it acceptable to put that into atom:author like this; authorname![CDATA[Bertrand Cafeacute;]]/name/author or should I be using the unicode numeric entity instead? Even if it was HTML you couldn't really use the entity, could you? I think you have to use a character reference or the actual character instead, yes.
Re: atom:name ... text or html?
Hahaha! It's RSS all over again. In the words of Mark Pilgrim: Here's something that might be HTML. Or maybe not. I can't tell you, and you can't guess. :-) Seriously though, the atom:name element is described as a human-readable name, so unless your name really is Betrand Cafeacture; that can't be right. If RFC4287 had intended to allow markup in the element it would have used atomTextConstruct. Regards James Eric Scheid wrote: If I have an author with the name Bertrand Café, is it acceptable to put that into atom:author like this; authorname![CDATA[Bertrand Cafeacute;]]/name/author
Re: atom:name ... text or html?
* Eric Scheid [EMAIL PROTECTED] [2006-03-23 17:30]: If I have an author with the name Bertrand Café, is it acceptable to put that into atom:author like this; authorname![CDATA[Bertrand Cafeacute;]]/name/author No. That means the author’s name is Bertrand Cafeacute; (he must have had very cruel parents), not Bertrand Café. or should I be using the unicode numeric entity instead? Yes. Or use a literal é as you did in this mail, provided you emit the feed as UTF-8 (or ISO-8859-1, if you must). Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: text/html with mode=xml in Atom 0.3
* James Holderness [EMAIL PROTECTED] [2006-03-23 17:30]: So is this a bug in the content generator (all the feeds I've seen appear to be using TypePad) Yes. or are you supposed to ignore the mode attribute when the content type is set to text/html and always treat it as escaped? No. In 0.3, the `mode` attribute was the final arbiter for the form of the content. In Atom 1.0, its role was subsumed by switching on the `type` value because consumer developers reported that this sort of layering was unnecessarily hard to support and provided no discernible benefit. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: atom:name ... text or html?
Seriously though, the atom:name element is described as a human-readable name, Do you mean that human-readable is equivalent to solely English? Because as a French, having accents in names is so natural that I see it as human readable too ;) - Sylvain
Re: atom:name ... text or html?
Sylvain Hellegouarch wrote: Do you mean that human-readable is equivalent to solely English? Because as a French, having accents in names is so natural that I see it as human readable too ;) No. I mean that the literal sequence of characters e a c u t e ; is not human-readable (or at least isn't intended to be). Regards James
Re: atom:name ... text or html?
On Fri, Mar 24, 2006 at 03:16:18AM +1100, Eric Scheid [EMAIL PROTECTED] wrote a message of 10 lines which said: or should I be using the unicode numeric entity instead? Or the character itself, in UTF-8 or any other encoding (but UTF-8 is the most widely implemented, so you limit the risks). (That's what I do with http://www.bortzmeyer.org/feed.atom and it seems OK in every agregator and it validates.)
Re: atom:name ... text or html?
Thursday, March 23, 2006, 4:57:11 PM, you wrote: On 24/3/06 3:21 AM, Anne van Kesteren [EMAIL PROTECTED] wrote: authorname![CDATA[Bertrand Cafeacute;]]/name/author Even if it was HTML you couldn't really use the entity, could you? I think you have to use a character reference or the actual character instead, yes. It's true that XML has only a half dozen or so entities defined, meaning most interesting entities from html can't exist in XML ... unless maybe they are wrapped like in CDATA block like above? atom:name is not intended to contain HTML, the spec for it doesn't mention HTML, it is no more correct to put HTML in it, than it is to put base64'd PDF in there. I'm getting the data by scraping an html page, so I'm expecting it to be acceptable html code, including html entities. Your HTML parser should decode the entities for you and return a string. Your Atom generator should encode or escape the string using numeric entities. If you really need to use HTML entities directly, then you could put: !DOCTYPE feed [ !ENTITY eacute #233; ] at the top of your feed and get rid of that CDATA. XML processors are REQUIRED [1] to process internal DTD subsets. [Hmm, internal DTD subsets completely fail in IE7's feed reader, throwing up a friendly error message] [1] http://www.w3.org/TR/2004/REC-xml-20040204/#proc-types -- Dave
Re: atom:name ... text or html?
On Thu, Mar 23, 2006 at 05:01:03PM +0100, Sylvain Hellegouarch [EMAIL PROTECTED] wrote a message of 11 lines which said: Because as a French, having accents in names is so natural that I see it as human readable too ;) As I wrote and used and tested on my blog, there is no problem in Atom to have a first name with accent like mine. Atom is XML and therefore Unicode rules.
Re: text/html with mode=xml in Atom 0.3
A. Pagaltzis wrote: So is this a bug in the content generator (all the feeds I've seen appear to be using TypePad) Yes. or are you supposed to ignore the mode attribute when the content type is set to text/html and always treat it as escaped? No. Thanks for the confirmation. I was beginning to think I was wrong. I tested this in 15 different aggregators and all but one ignored the mode and unescaped the content anyway. I have a horrible feeling I'm going to have to add code to emulate this behaviour. Regards James
Re: atom:name ... text or html?
* Eric Scheid [EMAIL PROTECTED] [2006-03-23 18:05]: It's true that XML has only a half dozen or so entities defined, meaning most interesting entities from html can't exist in XML ... unless maybe they are wrapped like in CDATA block like above? No, a CDATA block simply means that characters like , and stand for themselves. I'm getting the data by scraping an html page, so I'm expecting it to be acceptable html code, including html entities. Then decode the entities to a Unicode string and emit the feed as Unicode. Simplest thing that will work reliably. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: atom:name ... text or html?
* Sylvain Hellegouarch [EMAIL PROTECTED] [2006-03-23 18:15]: Do you mean that human-readable is equivalent to solely English? Because as a French, having accents in names is so natural that I see it as human readable too ;) Even as a French, you probably write é, not eacute;. :-) Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: atom:name ... text or html?
On Mar 23, 2006, at 9:48 AM, James Holderness wrote: Hahaha! It's RSS all over again. In the words of Mark Pilgrim: Here's something that might be HTML. Or maybe not. I can't tell you, and you can't guess. :-) Seriously though, the atom:name element is described as a human- readable name, so unless your name really is Betrand Cafeacture; that can't be right. If RFC4287 had intended to allow markup in the element it would have used atomTextConstruct. I agree with James here--if we had intended for the name to be able to include markup, we should have used the construct we created to allow that. This from RFC 4287 (section 3.2): element atom:name { text } would have been this: element atom:name { atomTextConstruct } if we had intended for it to be able to contain anything but literal text after XML un-escaping, right? On Mar 23, 2006, at 9:57 AM, Eric Scheid wrote: It's true that XML has only a half dozen or so entities defined, meaning most interesting entities from html can't exist in XML ... unless maybe they are wrapped like in CDATA block like above? If they're wrapped in a CDATA block, then they don't trigger an XML parsing error, but wrapping something in CDATA isn't a license to enter data in a format other than what the RFC allows. I'm getting the data by scraping an html page, so I'm expecting it to be acceptable html code, including html entities. You, the producer, are getting the data from an HTML page, so you should certainly be prepared to handle HTML entities in it. But you the Atom publisher are responsible for making sure that you've made any changes to the data that are necessary for it to be proper Atom before you publish it. The consumer of the Atom feed doesn't know where you got the data, and thus can't be expected to decide how to process it based on where you got it.
Re: text/html with mode=xml in Atom 0.3
* James Holderness [EMAIL PROTECTED] [2006-03-23 18:40]: I tested this in 15 different aggregators and all but one ignored the mode and unescaped the content anyway. Good thing this rule was changed in Atom 1.0, then… What I really don’t get is what that `xmlns` attribute is doing there in the CDATA block of your data sample. Sometimes I wonder if CDATA should not have been left out of the XML spec; it seems to create far too much confusion to be worthwhile. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: atom:name ... text or html?
David Powell wrote: [Hmm, internal DTD subsets completely fail in IE7's feed reader, throwing up a friendly error message] If I remember correctly they considered that a feature. Something to do with DTDs being a security risk. I'm not sure if this also meant they were incapable of processing Netscape RSS 0.91 feeds. All I know is that if I ever have a blog, I'll be sure to include a DTD at the top of my feed. Regards James
Does xml:base apply to type=html content?
xml:base applies to type=xhtml content, but I'm not sure whether it is supposed to apply to escaped type=html content? I reckon that it does. Anybody came across this? Any opinions? -- Dave
Re: text/html with mode=xml in Atom 0.3
A. Pagaltzis wrote: What I really don’t get is what that `xmlns` attribute is doing there in the CDATA block of your data sample. Sometimes I wonder if CDATA should not have been left out of the XML spec; it seems to create far too much confusion to be worthwhile. Well if you look at some of those feeds I listed, many of the entries are type=application/xhtml+xml with a namespaced div element as you would expect. It looks like they may have taken the exact same code (or template, or however it is they do this stuff) and reused it for type=text/html. Only with the html they decided they should wrap everything in a CDATA block just to be safe. Regards James
Re: atom:name ... text or html?
On Mar 23, 2006, at 8:01 AM, Sylvain Hellegouarch wrote: Seriously though, the atom:name element is described as a human- readable name, Do you mean that human-readable is equivalent to solely English? Because as a French, having accents in names is so natural that I see it as human readable too ;) You can have accents, you just can't use HTML entities to get them. -Tim
Re: atom:name ... text or html?
On Mar 23, 2006, at 8:57 AM, Eric Scheid wrote: On 24/3/06 3:21 AM, Anne van Kesteren [EMAIL PROTECTED] wrote: authorname![CDATA[Bertrand Cafeacute;]]/name/author Even if it was HTML you couldn't really use the entity, could you? I think you have to use a character reference or the actual character instead, yes. It's true that XML has only a half dozen or so entities defined To be precise, 5: lt; amp; gt; apos; quot; -Tim
Re: atom:name ... text or html?
On Mar 23, 2006, at 8:16 AM, Eric Scheid wrote: If I have an author with the name Bertrand Café, is it acceptable to put that into atom:author like this; authorname![CDATA[Bertrand Cafeacute;]]/name/author or should I be using the unicode numeric entity instead? The key point is that the atom:name element, described in RFC4287 3.2.1, is not a Text Construct, as defined in 3.1, so you can't say atom:name type=html; so no markup allowed. So just say Bertrand Café. -Tim
Re: Atom Thread Feed syntax
Just wanted to follow through on this for everyone. Given that there are vendors getting ready to ship code based on the current rev of the spec, I'm *not* going to rename the id attribute to ref. Yes, I know that id is confusing to some folks, but we're just talking the name of a single attribute and not a critical functional bug. From this point forward, only critical spec bugs will be fixed and I will be submitting the spec for consideration as a standards track RFC in the not too distant future. - James Sylvain Hellegouarch wrote: Hi everyone, I was reading the Atom Feed Thread draft [1] yesterday and I ran into a problem as I described in my blog [2]. To recap the 'in-reply-to' element defined in that specification takes an 'id' attribute that specifies /the universally unique identifier of the resource being responded to/. Calling such an attribute 'id' is a mistake in my opinion as it confuses with the actual ID of the element itself within the XML document it belongs to and it makes impossible for another element within the document to have the same value as an 'id'. I would rather move the content of that attribute as a text element of the 'in-reply-to' element (as does the atom:id element). Thoughts? - Sylvain [1] http://www.ietf.org/internet-drafts/draft-snell-atompub-feed-thread-05.txt [2] http://www.defuze.org/archives/2006/03/14/about-atom-feed-threads
Re: atom:name ... text or html?
On 24/3/06 4:42 AM, A. Pagaltzis [EMAIL PROTECTED] wrote: I'm getting the data by scraping an html page, so I'm expecting it to be acceptable html code, including html entities. Then decode the entities to a Unicode string and emit the feed as Unicode. Simplest thing that will work reliably. I figured as much. Oh well, now to track down a list of html entities and their corresponding unicodes ... e.
Re: atom:name ... text or html?
On Mar 23, 2006, at 2:20 PM, Eric Scheid wrote: Oh well, now to track down a list of html entities and their corresponding unicodes ... http://www.google.com/search?q=xhtml%20entities
Re: atom:name ... text or html?
* Eric Scheid [EMAIL PROTECTED] [2006-03-23 23:30]: Oh well, now to track down a list of html entities and their corresponding unicodes ... That would be in the spec. http://www.w3.org/TR/REC-html40/sgml/entities.html But you shouldn’t have to. Any self-respecting language has a library for that somewhere. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Atom Thread Feed syntax
Thursday, March 23, 2006, 9:39:09 PM, James M Snell wrote: Just wanted to follow through on this for everyone. Given that there are vendors getting ready to ship code based on the current rev of the spec, I'm *not* going to rename the id attribute to ref. Yes, I know that id is confusing to some folks, but we're just talking the name of a single attribute and not a critical functional bug. From this point forward, only critical spec bugs will be fixed and I will be submitting the spec for consideration as a standards track RFC in the not too distant future. I'm more bothered about the use of undefined markup on the link element. I know, I know, I keep going on and on about this, but I keep seeing more drafts that do the same thing and it isn't just a theoretical problem: Windows Feed Platform does not preserve arbitrary markup other than proper extension elements. Other feed stores and servers are likely to do the same (justifiably IMO). The abandonment of extension constructs in favour of undefined markup by this draft, and other draft-*-atompub-* drafts would be an interoperability concern if these drafts were deployed. If you want to extend Atom, use Extension Elements. -- Dave
Re: Atom Thread Feed syntax
* David Powell [EMAIL PROTECTED] [2006-03-24 02:20]: The abandonment of extension constructs in favour of undefined markup by this draft, and other draft-*-atompub-* drafts would be an interoperability concern if these drafts were deployed. If you want to extend Atom, use Extension Elements. I don’t follow. Please explain how these drafts fail to satisfy the criteria in Section 6.4.2, Structured Extension Elements. Regards, -- Aristotle Pagaltzis // http://plasmasturm.org/
Re: Atom Thread Feed syntax
I believe the concern is over the thr:count and thr:when attributes for the replies link relation, both of which are optional, and both of which provide what I consider to be extra information. In other words, it's ok if an implementation drops them. The important bit is the in-reply-to element and the replies link rel, both of which fall within the bounds of the Atom extension model. - James A. Pagaltzis wrote: * David Powell [EMAIL PROTECTED] [2006-03-24 02:20]: The abandonment of extension constructs in favour of undefined markup by this draft, and other draft-*-atompub-* drafts would be an interoperability concern if these drafts were deployed. If you want to extend Atom, use Extension Elements. I don’t follow. Please explain how these drafts fail to satisfy the criteria in Section 6.4.2, Structured Extension Elements. Regards,
Re: Atom Thread Feed syntax
David Powell wrote: [snip] The abandonment of extension constructs in favour of undefined markup by this draft, and other draft-*-atompub-* drafts would be an interoperability concern if these drafts were deployed. If you want to extend Atom, use Extension Elements. I'm most certainly not abandoning the extension constructs. One of the motivations for walking these extension specs through the I-D and eventually standards-track process is so that they get their own RFC number. Implementations that choose to support the extension can point to RFC4287 *and* RFCwhatever and say, I support both. If an implementation only says I support RFC4287 and doesn't say anything about RFCwhatever, it's pretty clear what the result would be. The most an RFC4287 implementation should be expected to do is adhere to the defined extension model. If that implementation also chooses to support other RFC's that go beyond that extension model, so be it. That said, the critical parts of the Feed Thread draft (the in-reply-to element and the replies link rel) follow the guidelines of the Atom extension model. That is, any RFC4287 implementation *should* be able to do something with those elements (even if it's just preserving them). The optional parts of the extension (thr:count an thr:when) fall outside of the Atom extension model. That's ok. Implementations can choose to ignore those things, even completely drop them. As for the other extension drafts I put out, keep in mind that most should be considered strictly experimental at this time. That said, there is really only one that really falls outside the extension model.. the Link Extensions draft [1]... which, by definition cannot adhere to the extension model given the fact that Atom link elements are actually not extensible. [1] http://www.ietf.org/internet-drafts/draft-snell-atompub-link-extensions-02.txt - James