Hi Richard, Richard Kelly <[email protected]> wrote on 07/12/2009 05:59:36 AM:
> Hi everyone, > > I've made some progress on my character normalization, and I > would like to get some feedback on my work to ensure I'm on the > right path. I've had an opportunity to review your code. What you have so far is looking really good. Great work! > I've uploaded the current state of my patches on JIRA [1]. I do have some suggestions for improvements which I'll attach to the JIRA issue. > CharacterNormalizer.java is the new component that does the actual work. > CharacterNormalizer.patch is all the changes to existing files that I > needed to make. > > The relevant SAX [2] and DOM [3][4] character normalization features > do appear to be working as intended with these changes (except for the > tasks mentioned below). I've implemented it as an XNI component as we > discussed and use two Xerces features to control this component and > determined whether or not it gets added to the pipeline. > > Still on my to do list: > - DOM Level 3 normalizeDocument() and Node.normalize() functions: > These functions don't use the pipeline so I am planning to add code to > directly call the component from within these functions. > - Multiple character data stream events are not handled correctly: > Since unicode characters can be larger than 16-bits they may get split > up across multiple calls to 'characters' events. If this happens the > character may not be normalized correctly. In order to avoid this, I > plan to use a buffer within my component to keep track of characters > that overlap these events. > - A comprehensive set of tests to check that the features work as > described in the standards. I've done basic testing for a number of > cases (which it passed successfully) but obviously we would want > something more comprehensive and also do some performance testing. > > If anyone would like to take a look and see if there are any obvious > problems, that would be great. > > thanks, > Richard > > [1] https://issues.apache.org/jira/browse/XERCESJ-1383 > [2] http://www.saxproject.org/apidoc/org/xml/sax/package-summary. > html#package_description > [3] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-check- > character-normalization > [4] http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter- > normalize-characters > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] Thanks. Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: [email protected] E-mail: [email protected]
