[P&C] Low-level internationalization, XML deserialization, IRI or URI, IRI normalization

Marcin Hanclik Sat, 25 Jul 2009 02:12:45 -0700

Hi Marcos, All,

Regarding the usage of IRI in the widget configuration document, I do not know 
which speicification is responsible for mandating the IRI normalization.
It is possible that I simply have not yet found the proper existing explanation 
to the issue, so if you know it, I would be grateful to get this information.


These are more details.

The P&C spec mixes the targets of the grammars (or low-level format 
specifications) it operates on.
E.g.
the sections about Zip archive operate on bytes
http://www.w3.org/TR/widgets/#zip-archive
http://www.w3.org/TR/widgets/#version-of-zip-needed-to-extract-a-file-

and
zip-rel-path grammar
http://www.w3.org/TR/widgets/#zip-rel-path
operates on characters, not bytes (it may not be fully clear from the P&C text).

XML Fifth Edition refers only to URI specification, it does not know about IRI.

WUA must support XML and UTF-8:
http://www.w3.org/TR/widgets/#dependencies-on-other-specifications-and

The configuration document is only required to be XML:
http://www.w3.org/TR/widgets/#configuration-document
and its encoding may be virtually any that is registered with IANA (my 
assumption).

So we can have the following situation:
The WUA, that I develop widgets for, has a very interesting feature, whose IRI 
is really international (Polish in this case):

http://example.com/ŁódzkiŚpiewnikŹdźbłowy

I.e. the IRI contains characters outside of the US-ASCII character set.
Then, I may not have an UTF-X capable editor at hand, so I convert the IRI to 
URI as in
http://tools.ietf.org/html/rfc3987#section-3.1, Step 2. and
I write the following config.xml with US-ASCII legacy encoding:

<?xml version=”1.0” encoding=”us-ascii”>
<widget …>
…
<feature 
name=”http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy” 
/>
…
</widget>

http://tools.ietf.org/html/rfc3987#section-3.2 provides a method to convert URI 
to IRI.
However, I am not sure whether this conversion is mandated in P&C, since P&C 
just says that e.g. the name attribute is an IRI:
I am not sure whether it should have IRI syntax in config.xml (not possible in 
my case, since I use US-ASCII only) or later.

Percent encoding is allowed in IRIs:
http://tools.ietf.org/html/rfc3987#section-2.2
and
"Terminals in the ABNF are characters, not bytes."

Therefore it seems possible that the above config.xml, when parsed by XML- and 
UTF-8-supporting WUA, will refer to a feature whose IRI would be

http://example.com/%C5%81%C3%B3dzki%C5%9Apiewnik%C5%B9d%C5%BAb%C5%82owy

on the character level. Then, this valid IRI has to be checked for equivalence 
with

http://example.com/ŁódzkiŚpiewnikŹdźbłowy

based on the algorithm specified in 
http://tools.ietf.org/html/rfc3987#section-5.1
and
http://tools.ietf.org/html/rfc3987#section-5.3.1

http://tools.ietf.org/html/rfc3987#section-5.3.2, specifically section 5.3.2.3 
mentions percent-encoding normalization.
I am not sure whether DOM3Core Load&Save mechanisms perform such normalization 
(as also below).
P&C does not specify it.

P&C says:
"An attribute defined as containing a valid IRI. A valid IRI is one that 
matches the IRI  token of the [RFC3987] specification."
Again, is it the syntax in config.xml (i.e. impossible on byte level) or later?

DOM3Core http://www.w3.org/TR/DOM-Level-3-Core/core.html says
"A solution for loading a Document and saving it persistently is proposed in 
[DOM Level 3 Load and Save]."

DOM3LS http://www.w3.org/TR/DOM-Level-3-LS/load-save.html will normalize the 
entities AFAIK, but probably will not normalize percent-encoded characters in 
URI/IRI.

Proposal

http://www.w3.org/TR/REC-xml-names/#iri-use says:
"Because of the risk of confusion between URIs that would be equivalent if 
dereferenced, the use of %-escaped characters in namespace names is strongly 
discouraged."

So maybe P&C shall state something similar, e.g.

"Because of the risk of confusion between IRIs that would be equivalent if 
dereferenced, the use of %-escaped characters in feature names is strongly 
discouraged."

This could result in percent-encoded IRIs not be present in the configuration 
document, and the need for the configuration document developer to use UTF-8 
capable editor (it may be too hard requirement, it is just a proposal).

Alternatively, we could specify in P&C that the attributes – that are currently 
specified as being IRI – shall actually be "IRI or URI" depending on the 
encoding of the config.xml.

Third option would be to say something about IRI/URI normalization.

More comments:

The part of
http://www.w3.org/html/wg/href/draft.html#parsing-urls
namely:
„How does this compare to just parsing using the IRI grammar of RFC 3987?”
makes me think that the problem (I assume my problem and the Web addresses are 
similar) is not yet fully solved in any spec.
I am sorry for any ignorance if such is identified.

The latest draft for IRI is this one:
http://tools.ietf.org/html/draft-duerst-iri-bis-06
and it is being discussed also in W3C, see e.g. very recent comments from Anne 
at
http://lists.w3.org/Archives/Public/public-iri/2009Jul/

These are the documents that could help more:
http://www.w3.org/International/articles/idn-and-iri/
http://www.w3.org/html/wg/href/draft-ietf.html (seems to be just newer version 
of the above draft )

Please let me know what you think.
Thanks.

Kind regards,
Marcin

________________________________________

Access Systems Germany GmbH
Essener Strasse 5  |  D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is 
privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or 
distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by 
responding to this e-mail. Thank you.

[P&C] Low-level internationalization, XML deserialization, IRI or URI, IRI normalization

Reply via email to