Re: GSOC 2009 - Unicode Normalization Project

Michael Glavassevich Sun, 29 Mar 2009 23:28:29 -0700

Hi Richard,

I've now had a chance to read over your proposal. Looks really good and
detailed. I don't think your exam period should be an issue. I expect many
other students will also be in class and have exams over GSoC and have
mentored a few in previous years in this situation. It looks like you have
a good plan for handling that.


Some comments which should help you improve your proposal:

In the scope section:

> In addition, this project will only aim to normalize characters within
the Unicode standard; it will not normalize characters from legacy
encodings.

All XML characters are Unicode. If you were thinking of other character
encodings besides UTF-*, these all get converted to Java chars on input so
essentially Xerces is always working on UTF-16 and thus the normalization
checker / normalizer will always see a "Unicode encoding form".

In the approach section:

> Since the code between these functions can mostly be shared, the code can
be implemented as a single XNI component. This allows it to be easily
plugged into the pipeline of any XML parser or called by the DOM parser
when necessary.

Probably something you've already realized but worth clarifying... The
pipeline (XMLParserConfiguration [1]) is shared between the SAX and DOM
(and perhaps one day StAX) parsers, so these features equally apply to the
existing SAX XMLReader, JAXP SAXParser and DocumentBuilder. There's already
a standard SAX feature defined for normalization checking [2]. We should
probably define a Xerces' specific feature URI to cover the normalization
function which could be set on the SAX parser, similar to the parameter
defined in DOM Level 3 Core / Load & Save. For a DOM in memory the
normalizing / normalization checking functions would be invoked by setting
the parameters on the DOMConfiguration and calling normalizeDocument(). In
addition to plugging in the XNI component here it would also involve
updating the DOM with the normalized text. And when a DOM is loaded with an
LSParser if the LSInput.certifiedText [3] flag is true, I believe the
intention is that normalization processing is skipped so should have some
way to bypass the normalization component (e.g. excluding it from the
pipeline) when the input claims to be certified.

In the deliverables section:

> Source code and makefiles for an XNI component that provides Unicode
normalization functionality

We haven't used makefiles in ages. Xerces builds with Ant. The build.xml
file will probably need a few updates (e.g. to include an ICU jar).

In the Community Interaction section:

> I have already joined the Xerces Developer mailing list and will use this
as a way of providing weekly updates on the status of my project to other
members. In addition I will report to my mentor directly every other day.

Would be great if we could have project discussion out in the open as much
as possible. Part of the experience of developing in Apache.

Thanks.

[1] http://xerces.apache.org/xerces2-j/xni-config.html
[2]
http://xerces.apache.org/xerces2-j/features.html#unicode-normalization-checking
[3]
http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSInput-certifiedText

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [email protected]
E-mail: [email protected]

Michael Glavassevich/Toronto/i...@ibmca wrote on 03/29/2009 09:55:49 AM:

> Hi Richard,
>
> Richard Kelly <[email protected]> wrote on 03/29/2009 09:49:03 AM:
>
> > Hi,
> >
> > I've got my project proposal up.  Please let me know if you can suggest
any
> > improvements.
>
> Great. I'll take a look today.
>
> > The URL is:
> > http://wiki.apache.org/general/SoC2009/RichardKelly-Xerces-
> > NormalizationProposal
> >
> > Thanks!
> > Richard Kelly
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: [email protected]
> E-mail: [email protected]

Re: GSOC 2009 - Unicode Normalization Project

Reply via email to