Hi everyone, I've made some progress on my character normalization, and I would like to get some feedback on my work to ensure I'm on the right path.
I've uploaded the current state of my patches on JIRA [1]. CharacterNormalizer.java is the new component that does the actual work. CharacterNormalizer.patch is all the changes to existing files that I needed to make. The relevant SAX [2] and DOM [3][4] character normalization features do appear to be working as intended with these changes (except for the tasks mentioned below). I've implemented it as an XNI component as we discussed and use two Xerces features to control this component and determined whether or not it gets added to the pipeline. Still on my to do list: - DOM Level 3 normalizeDocument() and Node.normalize() functions: These functions don't use the pipeline so I am planning to add code to directly call the component from within these functions. - Multiple character data stream events are not handled correctly: Since unicode characters can be larger than 16-bits they may get split up across multiple calls to 'characters' events. If this happens the character may not be normalized correctly. In order to avoid this, I plan to use a buffer within my component to keep track of characters that overlap these events. - A comprehensive set of tests to check that the features work as described in the standards. I've done basic testing for a number of cases (which it passed successfully) but obviously we would want something more comprehensive and also do some performance testing. If anyone would like to take a look and see if there are any obvious problems, that would be great. thanks, Richard [1] https://issues.apache.org/jira/browse/XERCESJ-1383 [2] http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html#package_description [3] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-check-character-normalization [4] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-normalize-characters --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
