Hi Richard, I've now had a chance to read over your proposal. Looks really good and detailed. I don't think your exam period should be an issue. I expect many other students will also be in class and have exams over GSoC and have mentored a few in previous years in this situation. It looks like you have a good plan for handling that.
Some comments which should help you improve your proposal: In the scope section: > In addition, this project will only aim to normalize characters within the Unicode standard; it will not normalize characters from legacy encodings. All XML characters are Unicode. If you were thinking of other character encodings besides UTF-*, these all get converted to Java chars on input so essentially Xerces is always working on UTF-16 and thus the normalization checker / normalizer will always see a "Unicode encoding form". In the approach section: > Since the code between these functions can mostly be shared, the code can be implemented as a single XNI component. This allows it to be easily plugged into the pipeline of any XML parser or called by the DOM parser when necessary. Probably something you've already realized but worth clarifying... The pipeline (XMLParserConfiguration [1]) is shared between the SAX and DOM (and perhaps one day StAX) parsers, so these features equally apply to the existing SAX XMLReader, JAXP SAXParser and DocumentBuilder. There's already a standard SAX feature defined for normalization checking [2]. We should probably define a Xerces' specific feature URI to cover the normalization function which could be set on the SAX parser, similar to the parameter defined in DOM Level 3 Core / Load & Save. For a DOM in memory the normalizing / normalization checking functions would be invoked by setting the parameters on the DOMConfiguration and calling normalizeDocument(). In addition to plugging in the XNI component here it would also involve updating the DOM with the normalized text. And when a DOM is loaded with an LSParser if the LSInput.certifiedText [3] flag is true, I believe the intention is that normalization processing is skipped so should have some way to bypass the normalization component (e.g. excluding it from the pipeline) when the input claims to be certified. In the deliverables section: > Source code and makefiles for an XNI component that provides Unicode normalization functionality We haven't used makefiles in ages. Xerces builds with Ant. The build.xml file will probably need a few updates (e.g. to include an ICU jar). In the Community Interaction section: > I have already joined the Xerces Developer mailing list and will use this as a way of providing weekly updates on the status of my project to other members. In addition I will report to my mentor directly every other day. Would be great if we could have project discussion out in the open as much as possible. Part of the experience of developing in Apache. Thanks. [1] http://xerces.apache.org/xerces2-j/xni-config.html [2] http://xerces.apache.org/xerces2-j/features.html#unicode-normalization-checking [3] http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSInput-certifiedText Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: [email protected] E-mail: [email protected] Michael Glavassevich/Toronto/i...@ibmca wrote on 03/29/2009 09:55:49 AM: > Hi Richard, > > Richard Kelly <[email protected]> wrote on 03/29/2009 09:49:03 AM: > > > Hi, > > > > I've got my project proposal up. Please let me know if you can suggest any > > improvements. > > Great. I'll take a look today. > > > The URL is: > > http://wiki.apache.org/general/SoC2009/RichardKelly-Xerces- > > NormalizationProposal > > > > Thanks! > > Richard Kelly > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > Thanks. > > Michael Glavassevich > XML Parser Development > IBM Toronto Lab > E-mail: [email protected] > E-mail: [email protected]
