RE: BIDI (was Proposal: Atomext WG)

Brian Smith Mon, 07 Jan 2008 10:59:29 -0800

James M Snell wrote:
> Brian Smith wrote:
> > I agree. If there are going to be formatting codes in the document 
> > anyway, then why do we need a mechanism that duplicates their 
> > functionality?
> 
> Because their use in markup is problematic and is actively discouraged 
> for a number of very good reasons.


Even the latest W3C guidelines are now encouraging the use of some formatting 
characters (LRM and RLM) in addition to markup, where the previous guidelines 
recommended markup only:

http://www.w3.org/TR/i18n-html-tech-bidi/#ri20030218.135304584

So, Atom implementations have to be prepared to accept at least LRM and RLM in 
documents, anyway.

> > My hypothesis is that an implementation ignorant of BIDI issues is 
> > more likely to preserve the formatting characters than
> Atom/XHTML/HTML
> > BIDI markup, especially when the effects of those formatting 
> > characters never span multiple nodes in the document.
> 
> Have you tested this hypothesis using real editors?  Example, in our 
> internal blogging environment, tags are entered in a single text box, 
> each tag separated by a comma.  The system splits the tags into an 
> array and saves each tag separately.
> Each tag becomes a separate atom:category element.  Is the user 
> responsible for adding the appropriate formatting codes around each 
> individual tag?  When the user wishes to edit the entry later, perhaps 
> to add a new entry, are they supposed to just know that there are 
> non-visual bidi formatting codes interspersed into the comma separated 
> list of tags?

When software breaks apart BIDI text and recombines it, it has to preserve the 
BIDI formatting. In this case, the system that splits apart the tags into an 
array and/or the system that recombines the tags into a comma-seperated list 
should transparently handle the formatting codes. 

> You're assuming that all users have the same requirements.  
> In our environment, a single feed may include entries from many 
> different users. We have group blogs where users from many different 
> locales have edit rights on any entry in the blog.  Further, our users 
> use many different editors to write and manage their blog entries.  
> Asking those users to be mindful of how they're using bidi formatting 
> characters is a lot more difficult than what we currently do, which is 
> provide a simple check box to indicate whether or not the entry is 
> "right-to-left", which in turn, is translated into the appropriate 
> dir="rtl" in the markup.
>
> Clients that do not understand the dir attribute simply ignore it, and 
> since our software is written so that only explicit changes in value 
> are recognized (e.g. a missing dir attribute does not mean the dir 
> attribute value has changed) we're able to work seamlessly with 
> editors that do not support the attribute.

In my suggested mechanism, this concern is only relevent for language-sensitive 
attributes, and atom:name, since that is the only place where XHTML's BIDI 
markup cannot be used. Even then, it only applies to the few cases where there 
is a user that is editing part of an entry written in a RTL language, where the 
Unicode BIDI algorithm fails to work for that part of the entry, they are using 
software that doesn't support BIDI text entry (meaning they probably can't read 
or write any RTL languages), and their change somehow still requires the 
preservation of the RTL base directionality but doesn't require any other 
formatting codes. I agree that "extremely unlikely" isn't the same as 
"impossible," but in this case it seems pretty close.

> > Markup is just as non-visual as formatting codes, except for people 
> > using "view source" and the like.
> 
> Yes, but with the marup we don't have to rely on users getting it 
> right when they type in the values.

I agree that we can't rely on users to manually enter formatting codes. But, 
the software can handle the formatting codes as transparently as it handles the 
"dir" attribute.

> > RFC 4287 and RFC 5023 [are] pretty unclear about what is required to 
> > be preserved.
> 
> RFC 5023, Section 9.3: To avoid unintentional loss of data when 
> editing Member Entries or Media Link Entries, an Atom Protocol client 
> SHOULD preserve all metadata that has not been intentionally modified, 
> including unknown foreign markup as defined in Section 6 of [RFC4287].
>
> Seems pretty darn clear to me.

I will illustrate what I am saying with an example. An atom:link element is not 
unknown foreign markup. So, I can remove atom:link elements whenever I want. In 
particular, I can remove an atom:link element and then replace it with an 
another atom:link element that links to the same destination. That doesn't 
violate the specification but the old element might have a "dir" attribute and 
the new one might not.

> 9.3 uses the term "unknown foreign markup", which, if RFC
> 4287 defines as unknown elements AND attributes.

I don't see that anywhere in RFC 4287. Even if it was there, it would be very 
problematic to support that in all situations, and I would recommend against 
anybody depending on implementations to preserve unknown attributes on known 
elements, especially when the content of those known elements is being modified.

> > The category labels are almost always going to be composed of strong 
> > RTL and strong LTR characters (only), so the BIDI algorithm will 
> > work correctly. And, when it doesn't work, it is unlikely that it 
> > will be due to the wrong base directionality--in these cases, 
> > formatting characters are going to be needed no matter what. Right?
> 
> "almost always" is not "always".  The Atom bidi draft covers the cases 
> where "always" is more desirable than "almost always".

Please give an example of a category label that requires RTL base 
directionality, fails to be rendered correctly using the Unicode BIDI 
algorithm, doesn't require any directionality modifiers other than the base 
directionality, which your software supports, and which you expect every Atom 
implementation to support. AFIACT, this would be a label that starts or ends 
with punctuation, or is a mixture of LTR and RTL text, with punctuation at the 
point where the directionality changes. 

> And no, the formatting characters are not always going to be needed.  
> If I am rendering the text in (x)html, I don't want the formatting 
> characters in there at all; rather, I want to follow best practices 
> and use the (x)html provided bidi markup mechanisms.

I agree 100%. A BIDI-enabled Atom-to-HTML converter should always convert 
Unicode formatting characters to markup, whenever possible.

> Failing to implement the atom bidi spec has significantly fewer 
> consequences than improperly implementing the unicode formatting 
> characters.  That is, existing applications will be no worse off than 
> they currently are if the dir attribute is ignored or dropped; 
> however, existing applications can be severely impacted by the 
> improper use of the unicode bidi characters. Again, there are very 
> good reasons behind the recommendation against using the formatting 
> codes in markup.

Right, that is why my suggestion uses XHTML markup whenever possible, and fall 
back to formatting codes only as a last resort.

- Brian

RE: BIDI (was Proposal: Atomext WG)

Reply via email to