RE: BIDI (was Proposal: Atomext WG)

Brian Smith Mon, 07 Jan 2008 08:40:24 -0800

James M Snell wrote:
> Brian Smith wrote:
> > The Unicode/W3C guidelines say:
> > * Use *XHTML* BIDI markup whenever possible.
> > * Otherwise *CSS* whenever possible.
> > * Otherwise, consider building BIDI markup into your markup schema.
> > * We have to support BIDI formatting codes anyway, since 
> >   the above mechanisms don't solve all BIDI problems.
> 
> Sorry if I wasn't clear, I was referring specifically to your 
> assertion that, "it seems much easier to implement the an 
> existing BIDI markup mechanism"... Having already implemented 
> it, I can see no additional complexity or difficulty.


 I wrote an Atom-to-HTML converter in XSLT, where, as long as the BIDI 
formatting characters do not span (text/attribute) nodes in the document, the 
BIDI directionality is maintained. This is trivial to do by simply preserving 
the values of the nodes. In fact, many applications should *accidently* get it 
right as long as they are not doing rewriting of the actual content of the 
nodes; even then, they will still probably get it right. This is the primary 
basis of my assertion: significantly more BIDI-oblivious applications will 
preserve Unicode formatting characters than BIDI markup.

> > The BIDI draft says it only applies to constructs that 
> > RFC4287 labeled "language sensitive." Accordingly, the BIDI 
> > draft does not apply to extension elements.
> 
> Section 6.4.2: "Structured Extension elements are Language-Sensitive."
>
> atom:category and atom:link each have exactly one language-sensitive 
> attribute.

Which attributes on elements and subelements of structured extension elements 
are language-sensitive? It seems it must be either all or none.

> there's no reason to try eliminating all need for bidi formatting 
> characters in language sensitive attribute values.

I agree. If there are going to be formatting codes in the document anyway, then 
why do we need a mechanism that duplicates their functionality?

> > * Otherwise, use Unicode BIDI/Ruby formatting codes, such 
> that matching pairs of formatting codes are fully contained 
> within a single text or attribute node.
> 
> Whose responsibility is it to apply the formatting codes? The person 
> typing the text or the software?  How does the software know when to 
> apply the codes?

> Also, what about when an Atompub client edits an entry?
> Is the Atompub client responsible for preserving the unicode 
> formatting characters? What if they don't?

My hypothesis is that an implementation ignorant of BIDI issues is more likely 
to preserve the formatting characters than Atom/XHTML/HTML BIDI markup, 
especially when the effects of those formatting characters never span multiple 
nodes in the document.

> There are existing Atompub clients out there that, more than likely,
> will not, and since the formatting codes are non-visual, it's not
> likely the user will notice them either, causing unexpected
> rendering issues later on.

I don't understand why a user that needs BIDI functionality would use a client 
that doesn't have BIDI support. Further, I don't understand how clients that 
are incapable of generating BIDI formatting codes can generate markup compliant 
with your proposal. Are you expecting the users to edit the markup directly? 

Markup is just as non-visual as formatting codes, except for people using "view 
source" and the like.

> With the bidi attribute approach, per rfc5023, non-supporting
> clients are expected to at least preserve the bidi attribute
> but will otherwise continue working as they currently do,
> without risk of corrupting the text by inadvertently dropping
> or improperly nesting the bidi controls.

RFC 4287 and RFC 5023 is pretty unclear about what is required to be preserved. 
Firstly, RFC 5023 says that implementations can do whatever they want as long 
as the results are well-formed. Otherwise, AtomPub implementations that use 
(X)HTML whitelists would be non-compliant. Also, the requirement to preserve 
unknown foreign markup seems to apply more to unknown extension elements than 
to unknown attributes on known elements. In particular, if I replace the 
atom:author element with a new one, then I am not going to preserve the old, 
unknown attributes on the previous atom:author element.

> Also, imagine a case where we have a feed with 100 entries, each with 
> about 5 atom:category elements.  Let's stay that the feed is 
> generally all RTL.  Using your approach, that's at least 1000 extra 
> characters in the feed, and 500 opportunities for the embedding to
> be screwed up.

The category labels are almost always going to be composed of strong RTL and 
strong LTR characters (only), so the BIDI algorithm will work correctly. And, 
when it doesn't work, it is unlikely that it will be due to the wrong base 
directionality--in these cases, formatting characters are going to be needed no 
matter what. Right? 

> Re: The ruby formatting codes: Even the Unicode spec warns 
> against using the ruby formatting codes for anything other than internal 
> storage.  We gain absolutely nothing by bringing ruby into this discussion.

I disagree. Bringing Ruby text into the discussion emphasizes the need for 
*new* language-sensitive constructs to be text constructs, and emphasizes the 
need to support (X)HTML in all text constructs instead of depending on Unicode 
formatting codes.

> > * Editors of new documents must be meticulous about 
> >   inserting the proper markup and formatting codes.
> > * Processors of existing documents must be meticulous about 
> >   preserving BIDI/Ruby markup and/or formatting codes whenever 
> >   any part of the contained text is preserved.
> 
> Again, what about older editors that know nothing about the proper 
> markup or formatting codes?

All of these old editors will also fail to implement the Atom BIDI spec too. 
There are currently more editors that can insert formatting characters than can 
handle Atom BIDI markup, aren't there? I think that will always be the case.

> > I recognize that this goes against the Unicode in XML 
> > guidelines. However, Atom 
> > already goes against the guidelines by having 
> > language-sensitive text in attribute 
> > values and other contexts where XHTML markup cannot be used.
> 
> What is the benefit of going against the Unicode in XML guidelines?

1. It is extremely easy to implement. As long as an implementation copies text 
nodes and attribute node values verbatim, or using any transform that doesn't 
strip out formatting codes, the BIDI directionality will always be preserved, 
with no extra work. A processor can extract any entry, the content of any 
entry, any entry metadata, or any feed metadata, without having to rewrite the 
markup to preserve the BIDI information. Literally, it requires more work to 
screw it up than it is to implement it correctly.

2. It is general purpose. It works for RSS, Atom, and any other XML format.

3. It reuses existing markup without introducing any of its own.

Here is an expanded version of my proposal that provides BIDI and Ruby support 
for all language-sensitive constructs in Atom:.

In all cases, only use markup and/or formatting codes when necessary. It is 
recommended to avoid Ruby Unicode formatting characters whenever possible.

Text Constructs: Use (X)HTML, and use the (X)HTML BIDI and Ruby markup.

atom:content: Use (X)HTML whenever possible. For other XML content, use the 
existing BIDI and Ruby markup defined for that XML application; Otherwise, use 
Unicode formatting codes, and ensure that matching pairs of formatting codes do 
not span multiple text nodes. For cases where type="text" must be used, use 
Unicode formatting characters.

Language-sensitive attributes: Use Unicode formatting characters.

Extension elements: New language-sensitive extension elements should ensure 
that all language-sensitive text is enclosed in text constructs. In particular, 
do not use attributes for language-sensitive text, support at least 
type="xhtml", and support type="html" whenever possible. Be sure to support 
BIDI and Ruby markup.

atom:name: When consuming atom:name, treat it as though it was defined as a 
text construct, and allow for type="xhtml" and type="html". When producing 
atom:name elements, BIDI markup SHOULD be replaced with Unicode formatting 
characters when needed. Ruby markup may be replaced with Unicode formatting 
characters or stripped entirely. All markup should be stripped entirely, the 
type attribute should be removed.

- Brian

RE: BIDI (was Proposal: Atomext WG)

Reply via email to