Hi everyone,

I've made some progress on my character normalization, and I
would like to get some feedback on my work to ensure I'm on the
right path.

I've uploaded the current state of my patches on JIRA [1].

CharacterNormalizer.java is the new component that does the actual work.
CharacterNormalizer.patch is all the changes to existing files that I
needed to make.

The relevant SAX [2] and DOM [3][4] character normalization features
do appear to be working as intended with these changes (except for the
tasks mentioned below).  I've implemented it as an XNI component as we
discussed and use two Xerces features to control this component and
determined whether or not it gets added to the pipeline.

Still on my to do list:
- DOM Level 3 normalizeDocument() and Node.normalize() functions:
These functions don't use the pipeline so I am planning to add code to
directly call the component from within these functions.
- Multiple character data stream events are not handled correctly:
Since unicode characters can be larger than 16-bits they may get split
up across multiple calls to 'characters' events.  If this happens the
character may not be normalized correctly.  In order to avoid this, I
plan to use a buffer within my component to keep track of characters
that overlap these events.
- A comprehensive set of tests to check that the features work as
described in the standards.  I've done basic testing for a number of
cases (which it passed successfully) but obviously we would want
something more comprehensive and also do some performance testing.

If anyone would like to take a look and see if there are any obvious
problems, that would be great.

thanks,
Richard

[1] https://issues.apache.org/jira/browse/XERCESJ-1383
[2] 
http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html#package_description
[3] 
http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-check-character-normalization
[4] 
http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-normalize-characters

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to