RE: [i18n+P&C] IRI/URI normalization

Phillips, Addison Fri, 14 Aug 2009 09:03:00 -0700

Hello Marcin,

Thank you for the note. This is a PERSONAL response.


I immediately spotted some red flags in your email. You state:

> 1. The widget configuration document may contain only US-ASCII
> characters, and thus conform to P&C.

This appears to me to be false. The widget configuration document is defined as 
an XML document. The examples all declare the encoding as UTF-8. See, for 
example [1]. A UTF-8 encoded configuration document can represent any Unicode 
character sequence. For that matter, an XML document can also use numeric 
character entities to represent characters not in the document's character 
encoding and the configuration document could use any valid character encoding 
recognized by the XML processor.

However, you go on to say:

> 4. To use the non-US-ASCII feature-name, I would percent-encode it,
> as e.g. in [2]. (This seems to be the core of the problem, namely
> usage of feature-name specified in one language within the
> configuration document and text editor using another
> language/encoding).

Percent encoding is described in the IRI spec as part of mapping IRIs to URIs. 
If you encode it, you should decode it later. But it is not necessary to 
percent-encode it, even if you use US-ASCII-7 as the character encoding of your 
configuration document. You can use NCRs, for example (these are decoded by the 
XML processor in the WUA).

> Proposed solutions (OR-ed):
> 
> a. Define a rule similar to "10.1.4 Rule for Getting a Single
> Attribute Value" (or a statement in that rule) that would specify
> the IRI/URI normalization according to RFC3987 (section 5.3.2.3).

I would support changing this, although I note that the widget document goes 
out of its way to prohibit URL-encoding (percent encoding) of IRIs. As noted 
above, there are other ways to put non-ASCII path characters into your 
configuration document.

> 
> c. Mandate only UTF-8 encoded configuration documents and disallow
> other encodings (like Shift-JIS, ISO-XY etc).

This would be wrong. It doesn't really solve the problem anyway. The character 
encoding of the serialized XML document is not the limit on the characters that 
can be represented in it, although it might be inconvenient for a lot of NCRs 
to appear in a document.

Please note, I am not recommending that anyone actually use other character 
encodings. I always, personally, recommend that people use UTF-8 for XML. But 
if you need to use a legacy encoding, the spec should not necessarily prevent 
you from doing so.

> 
> d. Mandate only US-ASCII feature-names (probably bad/against
> internationalization).

This I18N WG would certainly object to this.

I hope that helps. The I18N WG will consider this at our next meeting (next 
week).

Kind Regards,

Addison


[1] http://www.w3.org/TR/2009/CR-widgets-20090723/#configuration-document

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

RE: [i18n+P&C] IRI/URI normalization

Reply via email to