Julian Reschke wrote:
I agree with what you say, but I fear this topic has been discussed and dicussed again in the past with no result. For some reason, people seem to prefer a format where "bad" data (tag soup) is allowed, and the task to fix it is put onto the recipients rather than the creators. For me it's very obvious that this is the wrong approach -- if you can require the recipients to do that, you can also require the same responsibility from the producers.
I'm afraid it is a "business" problem more than a technical one...
Say you require publishing software to transform the content (either transform HTML to XHTML or strip tags, which not always acceptable (you should replace an IMG element by it's ALT attribute's value, tables and lists may produce non-understandable text when cells/items are concatenated, etc.)), they most likely won't do it. Either they'll produce not-wellformed Atom feeds, or they "just" won't implement Atom and keep their RSS feeds.
In the same time, readers/aggregators will most likely read Atom feeds (as it will be easier, particularly if they don't use an embedded web browser component) in _addition_ to RSS feeds (for which they already deal with multiple versions and extension modules).
There most likely won't exist Atom-only readers/aggregators, so if publishing software already produce RSS (in any widely supported flavor), they won't produce Atom.
That's why Atom feeds must be easy to produce, and particularly from legacy HTML4 content.
Anyway, I recently made a much more modest request that the spec should state that XHTML is preferred over HTML (because many recipients will not be willing to process tag soup, so in the best case formatting is lost).
Data may also be lost, for example if IMG are not replaced by their ALT attribute's value.
Content may become hard to read, for example if the content contains a table or a list. For lists, it could be leveraged putting commas at the end of each item: <ul><li>first item,</li><li>second item</li></ul>.
And extra non-content text may appear, for example if the ALT attribute of an IMG is not used as expected or if the content contains SCRIPT or STYLE elements. Moreover, HTML4 has defined a NOSCRIPT element (there is no NOSTYLE element, because STYLE must appear in the HEAD in a valid HTML4 document).
> It seems that we can't even get a consensus for that, which is
really disappointing considering the expectations I had 18 months ago :-(
However, Atom is not only a feed syntax, it's also a publishing/editing protocol. This means publishing software implementing the Atom protocol will have to deal with this escaped-HTML too.
And even if most publishing software treats HTML "just" as some "opaque text" that's eaten and regurgitated "as is", some weblogs aim to produce XHTML (b2evolution or WordPress), thus they will need to deal with escaped-HTML and transform it to XHTML.
That's why a consensus is hard to find...
I'll try to find an even-better solution than the one I proposed in my answer to Graham and post it here tomorrow (it's nearly midnight in France). Some solution that does not seem to bring back the multipart/alternative type...
In a word, I'm deeply convinced that an all-XML syntax (i.e. including XHTML) is THE good way, but I must admit we must provide solutions to integrate escaped-HTML if we want Atom to be widely adopted (even RSS 1.0, and now RSS 1.1, have been extended to allow escaped-HTML...)
-- Thomas Broyer
