Hi Richard,

Richard Kelly <[email protected]> wrote on 06/23/2009 12:00:18 PM:

> Hi all,
>
> I'm finally finishing my exams this week, so I'll be able to dedicate
> more time to this project.  I thought I'd give an update of where I'm
> at.
> So far, I've done this:
> - Created a character normalization component that performs unicode
> normalization.
> - Modified XML11Configuration to handle the new features and to add
> and remove the component from the pipeline when appropriate.
> - Modified AbstractSAXParser to handle the SAX character normalization
flags.
> - Created basic test files to ensure the features are working as
expected.
> - Extended the character normalization component to deal with
> composing characters.
> - Updated the XML messages for character normalization errors
> - Built the ICU4J component and updated build.xml to use it.

This sounds really good. Looking forward to seeing your first patch.

> At the moment, I'm trying to map the 'relevant constructs' [1] in the
> XML specfication to relevant Document Handler events.  These
> constructs consist of:
>    1.  The replacement text of all parsed entities
>    2.  All text matching, in context, one of the following
> productions:  CData, CharData, content, Name, Nmtoken.
>
> After looking through the XML specification and correlating the above
> with DocumentHandler functions [2], I've interpreted this to mean:
> - normalize the text of 'characters' events (since this event matches
> replacement text, CData, CharData and content productions)
> - normalize QNames and XMLAttributes in any events where they occur
> (this matches most Name and Nmtoken productions)
> - normalize name parameters in doctypeDecl, startGeneralEntity,
> processingInstruction, and endGeneralEntity events (additional
> structures in which Name productions occur)

Possibly more than that. I think normalization applies to all content in
the document (including comments) with an additional requirement "that none
of the relevant constructs listed above begins (after character references
are expanded) with a composing character as defined by B Definitions for
Character Normalization".

> If anyone can think of other events in which these productions are
> used, I would be most grateful if you could point them out.
>
> Thanks for all your assistance so far, it has been a great help.
> regards,
> Richard
>
>
> [1] http://www.w3.org/TR/xml11/#sec-normalization-checking
> [2]
http://xerces.apache.org/xerces2-j/javadocs/xni/org/apache/xerces/xni/XMLDocumentHandler.html

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [email protected]
E-mail: [email protected]

Reply via email to