Re: GSOC 2009 - Unicode Normalization Project

Michael Glavassevich Thu, 02 Apr 2009 07:35:27 -0700

Hi Richard,

The application deadline is fast approaching: April 3rd at 19:00 UTC. I
think that translates to 5 or 6 AM on Saturday in Australia. If you haven't
already, you should update your proposal on the official site [1]. I'm not
sure that you will be able to edit after the deadline.


Thanks.

[1] http://socghop.appspot.com/

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [email protected]
E-mail: [email protected]

Michael Glavassevich/Toronto/i...@ibmca wrote on 03/31/2009 09:58:05 AM:

> Hi Richard,
>
> Richard Kelly <[email protected]> wrote on 03/31/2009 02:44:03 AM:
>
> > Hi Michael,
> >
> > Thanks for your thorough review, I'll revise my proposal ASAP using
> > your feedback.
> >
> >
> > 2009/3/30 Michael Glavassevich <[email protected]>:
> > >
> > > All XML characters are Unicode. If you were thinking of other
character
> > > encodings besides UTF-*, these all get converted to Java chars on
input so
> > > essentially Xerces is always working on UTF-16 and thus the
normalization
> > > checker / normalizer will always see a "Unicode encoding form".
> > >
> >
> > Ok, dealing with a single encoding should make it easier. I think I
> > got mixed up
> > when reading section 4.3.3 of the XML spec which mentions some other
> > encodings. [1]
> >
> > >
> > > Probably something you've already realized but worth clarifying...
The
> > > pipeline (XMLParserConfiguration [1]) is shared between the SAX
> and DOM (and
> > > perhaps one day StAX) parsers, so these features equally apply to the
> > > existing SAX XMLReader, JAXP SAXParser and DocumentBuilder.
> There's already
> > > a standard SAX feature defined for normalization checking [2]. We
should
> > > probably define a Xerces' specific feature URI to cover the
normalization
> > > function which could be set on the SAX parser, similar to the
parameter
> > > defined in DOM Level 3 Core / Load & Save. For a DOM in memory the
> > > normalizing / normalization checking functions would be invoked by
setting
> > > the parameters on the DOMConfiguration and calling
normalizeDocument(). In
> > > addition to plugging in the XNI component here it would also involve
> > > updating the DOM with the normalized text. And when a DOM is
> loaded with an
> > > LSParser if the LSInput.certifiedText [3] flag is true, I believe the
> > > intention is that normalization processing is skipped so should have
some
> > > way to bypass the normalization component (e.g. excluding it from the
> > > pipeline) when the input claims to be certified.
> > >
> >
> > I had one question about another class called XML11Char which letsyou
check
> > if a character is a valid XML 1.1 character.  Should normalization
> checks play
> > any role in this validation?
>
> XML11Char and its counterpart XMLChar are used for checking well-
> formedness: the set of rules which all XML documents must conform to
> (otherwise they're not XML). Well-formedness checking will have
> logically occurred before the normalization checker / normalizer
> sees the data. I wouldn't expect that you would need to call any of
> these methods again in that context.
>
> > Thanks,
> > Richard
> >
> > [1] http://www.w3.org/TR/xml11/#charencoding
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: [email protected]
> E-mail: [email protected]

Re: GSOC 2009 - Unicode Normalization Project

Reply via email to