[widgets] white space handling

Cyril Concolato Thu, 17 Dec 2009 04:55:30 -0800

Hi Widget addicts,

While reading again through the spec, I'm wondering why there are differences 
between the P&C spec and the XML spec in terms of white space handling.


P&C defines:
* "space characters" as: U+0020, U+0009, U+000A, U+000B, U+000C, U+000D
* "Unicode white space characters" as: U+0009-U+000D, U+0020, U+0085, U+00A0, 
U+1680, U+180E, U+2000-U+200A, U+2028, +2029, U+202F, U+205F, U+3000
* "control characters" as: U+0000-U+001F, U+007F
* "forbidden characters" as: control characters and U+003C, U+003E, U+003A, 
U+0022, U+002F, U+005C, U+007C, U+003F, U+002A, U+005E, U+0060, U+007B, U+007D, U+0021.
"space characters" are used in "Rule for Getting a Single Attribute Value", "Rule for Getting a List of 
Keywords From an Attribute", "Rule for Parsing a Non-negative Integer", "algorithm to derive the user agent 
locales" and ZIP handling.
"Unicode white space characters" are used only in "Rule for Getting Text Content 
with Normalized White Space"
"control characters" are only used only in "forbidden characters" and "forbidden 
characters" are used only in ZIP processing.

XML defines "white space" as: U+0020, U+0009, U+000A, U+000D

Given that, I have the following questions/remarks:

- Why do you define control characters, can't you put their code points in 
"forbidden characters"? This would simplify the spec and make it more easy to 
understand.

- Could you rename "forbidden characters" to "ZIP forbidden characters"? This 
would clearly indicate in which area they are forbidden and why they are defined.

- Why do the definition of P&C "space characters" and "Unicode white space charactes" 
differ from the XML "white space" definition?

For "Unicode white space characters", I could understand this difference since it's only used 
in the "Rule for Getting Text Content with Normalized White Space" which first applies XML 
parsing, DOM3 textContent behavior and then applies additional P&C-defined behavior. But still, I'm 
wondering: is this difference really needed? If yes, can you add a note explaining the rationale and 
difference with the basic XML processing.

For "space characters", why did you add U+000B and U+000C?

- Ignoring U+000B and U+000C, the "Rule for Getting a Single Attribute Value" seems to me 
to be already defined in XML as "Attribute-Value 
Normalization"(http://www.w3.org/TR/xml/#AVNormalize). I could understand that you want a 
self-contained spec but you should at least indicate that the behavior is the same as the basic XML 
processing.

Best regards,

Cyril
--
Cyril Concolato
Maître de Conférences/Associate Professor
Groupe Mutimedia/Multimedia Group
Telecom ParisTech
46 rue Barrault
75 013 Paris, France
http://concolato.blog.telecom-paristech.fr/

[widgets] white space handling

Reply via email to