RE: [i18n+P&C] IRI/URI normalization

Marcin Hanclik Fri, 14 Aug 2009 14:05:14 -0700

Hi Addison,

A self-correction of my below message.
I used wrong character table (ISO-8859-2 instead of Unicode).


http://example.com/&#163;&#243;dzki&#163;piewnik&#172;d&#188;b&#179;owy

should be:

http://example.com/&#x141;&#xf3;dzki&#x15a;piewnik&#x179;d&#x17a;b&#x142;owy


Thanks.

Kind regards,
Marcin
________________________________________
From: [email protected] [[email protected]] On Behalf Of Marcin 
Hanclik [[email protected]]
Sent: Friday, August 14, 2009 9:27 PM
To: Phillips, Addison; [email protected]
Cc: [email protected]; [email protected]
Subject: RE: [i18n+P&C] IRI/URI normalization

Hi Addison,

Great thanks for your rapid answer!

The red flags mean to me that some clarification is needed, I was probably not 
clear enough.
So here it goes.

>>> 1. The widget configuration document may contain only US-ASCII
>>> characters, and thus conform to P&C.
>>
>>This appears to me to be false.
The scenario I presented is meant to be just a use case.
I imagine a hypothetical situation that someone wants to use a feature whose 
name includes non-US-ASCII characters, but she/he has only US-ASCII editor at 
hand.

I quote here some fragments of my initial email in WebApps [1].

>P&C says:
>"An attribute defined as containing a valid IRI. A valid IRI is one that 
>matches the IRI  token of the [RFC3987] specification."
>Again, is it the syntax in config.xml (i.e. impossible on byte level) or later?
>
>DOM3Core http://www.w3.org/TR/DOM-Level-3-Core/core.html says
>"A solution for loading a Document and saving it persistently is proposed in 
>[DOM Level 3 Load and Save]."
>
>DOM3LS http://www.w3.org/TR/DOM-Level-3-LS/load-save.html will normalize the 
>entities AFAIK, but probably will not normalize percent-encoded characters in 
>URI/IRI.
>
>http://www.w3.org/TR/REC-xml-names/#iri-use says:
>"Because of the risk of confusion between URIs that would be equivalent if 
>dereferenced, the use of %-escaped characters in namespace names is strongly 
>discouraged."

You say:
>>The widget configuration document is defined as an XML document. The examples 
>>all declare the encoding as UTF-8.
>>See, for example [1]. A UTF-8 encoded configuration document can represent 
>>any Unicode
>>character sequence. For that matter, an XML document can also use numeric 
>>character entities
>>to represent characters not in the document's character encoding and the 
>>configuration document could use any
>>valid character encoding recognized by the XML processor.
OK.
RFC3987 provides normalization method between IRI and URI based on 
percent-encoding.
Whereas XML assumes another encoding based on  numeric character entities.
XML1.1 says [2]:
[10]    AttValue           ::=          '"' ([^<&"] | Reference)* '"'
                        |  "'" ([^<&'] | Reference)* "'"
[67]    Reference          ::=           EntityRef | CharRef
and so on.

So, my understanding is:
the IRI

http://example.com/ŁódzkiŚpiewnikŹdźbłowy

could be encoded as

http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy

based on RFC3987, or as

http://example.com/&#163;&#243;dzki&#163;piewnik&#172;d&#188;b&#179;owy

based on XML character entities encoding method.

In general, IMHO, it should simply be specified in P&C how the IRIs 
could/should be encoded.
Method based on character entities looks odd to me. That form cannot be used as 
URL e.g. in the browser (and e.g. we could imagine the situation that a feature 
name - IRI - could point to something as is, by copy-pasting of the string).
For practical reasons, one method would be enough, I think.

XML Namespaces specification uses URI as attribute value in XML, this seems to 
be similar to our situation.
[4] says:
"The IRI references below are also all different for the purposes of 
identifying namespaces:
      http://www.example.org/~wilbur
      http://www.example.org/%7ewilbur
      http://www.example.org/%7Ewilbur";
So as for me, the normalization based on RFC3987 is not mandated there and is 
even wrong for namespace identification use-case.
We may have further confusion :(

>>I always, personally, recommend that people use UTF-8 for XML.
Mandating UTF-8 as mandatory encoding for configuration document is one of the 
proposed solutions.
There are free UTF-8-capable editors, so my hypothetical situation could be 
easily overcome.
The only thing we seem to need is the clarification about what is mandated by 
the P&C specification.

>>But if you need to use a legacy encoding, the spec should not necessarily 
>>prevent you from doing so.
I wonder whether for practical reasons legacy encodings should not be excluded.

P&C says:
"A user agent must support the following specifications:

    * [XML].
    * [XMLNS].
    * [DOM3CORE].
    * [UTF-8]."

So by inclusion of [XML], it seems that other encodings than UTF-8 are 
implicitly mandated, or?
I am not sure whether this is the understanding in WebApps.
Also then - depending on the interpretation - a P&C compliant WUA could not be 
able to process config.xml written in ISO-XY.
E.g. I read the above as the requirement for config.xml to be encoded in either 
US-ASCII or UTF-8.

Additionally, XML1.1 [3] says:
"processors are, of course, not required to support all IANA-registered 
encodings"

>>I hope that helps. The I18N WG will consider this at our next meeting (next 
>>week).
Thanks again.

Kind regards,
Marcin

[1] http://lists.w3.org/Archives/Public/public-webapps/2009JulSep/0365.html
[2] http://www.w3.org/TR/xml11/#NT-AttValue
[3] http://www.w3.org/TR/xml11/#charencoding
[4] http://www.w3.org/TR/REC-xml-names/#NSNameComparison


________________________________________
From: Phillips, Addison [[email protected]]
Sent: Friday, August 14, 2009 6:01 PM
To: Marcin Hanclik; [email protected]
Cc: [email protected]; [email protected]
Subject: RE: [i18n+P&C] IRI/URI normalization

Hello Marcin,

Thank you for the note. This is a PERSONAL response.

I immediately spotted some red flags in your email. You state:

> 1. The widget configuration document may contain only US-ASCII
> characters, and thus conform to P&C.

This appears to me to be false. The widget configuration document is defined as 
an XML document. The examples all declare the encoding as UTF-8. See, for 
example [1]. A UTF-8 encoded configuration document can represent any Unicode 
character sequence. For that matter, an XML document can also use numeric 
character entities to represent characters not in the document's character 
encoding and the configuration document could use any valid character encoding 
recognized by the XML processor.

However, you go on to say:

> 4. To use the non-US-ASCII feature-name, I would percent-encode it,
> as e.g. in [2]. (This seems to be the core of the problem, namely
> usage of feature-name specified in one language within the
> configuration document and text editor using another
> language/encoding).

Percent encoding is described in the IRI spec as part of mapping IRIs to URIs. 
If you encode it, you should decode it later. But it is not necessary to 
percent-encode it, even if you use US-ASCII-7 as the character encoding of your 
configuration document. You can use NCRs, for example (these are decoded by the 
XML processor in the WUA).

> Proposed solutions (OR-ed):
>
> a. Define a rule similar to "10.1.4 Rule for Getting a Single
> Attribute Value" (or a statement in that rule) that would specify
> the IRI/URI normalization according to RFC3987 (section 5.3.2.3).

I would support changing this, although I note that the widget document goes 
out of its way to prohibit URL-encoding (percent encoding) of IRIs. As noted 
above, there are other ways to put non-ASCII path characters into your 
configuration document.

>
> c. Mandate only UTF-8 encoded configuration documents and disallow
> other encodings (like Shift-JIS, ISO-XY etc).

This would be wrong. It doesn't really solve the problem anyway. The character 
encoding of the serialized XML document is not the limit on the characters that 
can be represented in it, although it might be inconvenient for a lot of NCRs 
to appear in a document.

Please note, I am not recommending that anyone actually use other character 
encodings. I always, personally, recommend that people use UTF-8 for XML. But 
if you need to use a legacy encoding, the spec should not necessarily prevent 
you from doing so.

>
> d. Mandate only US-ASCII feature-names (probably bad/against
> internationalization).

This I18N WG would certainly object to this.

I hope that helps. The I18N WG will consider this at our next meeting (next 
week).

Kind Regards,

Addison


[1] http://www.w3.org/TR/2009/CR-widgets-20090723/#configuration-document

Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.




________________________________________

Access Systems Germany GmbH
Essener Strasse 5  |  D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is 
privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or 
distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by 
responding to this e-mail. Thank you.


________________________________________

Access Systems Germany GmbH
Essener Strasse 5  |  D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is 
privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or 
distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by 
responding to this e-mail. Thank you.

RE: [i18n+P&C] IRI/URI normalization

Reply via email to