Re: GSOC 2009 - Unicode Normalization Project

Michael Glavassevich Tue, 31 Mar 2009 06:58:39 -0700

Hi Richard,

Richard Kelly <[email protected]> wrote on 03/31/2009 02:44:03 AM:


> Hi Michael,
>
> Thanks for your thorough review, I'll revise my proposal ASAP using
> your feedback.
>
>
> 2009/3/30 Michael Glavassevich <[email protected]>:
> >
> > All XML characters are Unicode. If you were thinking of other character
> > encodings besides UTF-*, these all get converted to Java chars on input
so
> > essentially Xerces is always working on UTF-16 and thus the
normalization
> > checker / normalizer will always see a "Unicode encoding form".
> >
>
> Ok, dealing with a single encoding should make it easier. I think I
> got mixed up
> when reading section 4.3.3 of the XML spec which mentions some other
> encodings. [1]
>
> >
> > Probably something you've already realized but worth clarifying... The
> > pipeline (XMLParserConfiguration [1]) is shared between the SAX and DOM
(and
> > perhaps one day StAX) parsers, so these features equally apply to the
> > existing SAX XMLReader, JAXP SAXParser and DocumentBuilder. There's
already
> > a standard SAX feature defined for normalization checking [2]. We
should
> > probably define a Xerces' specific feature URI to cover the
normalization
> > function which could be set on the SAX parser, similar to the parameter
> > defined in DOM Level 3 Core / Load & Save. For a DOM in memory the
> > normalizing / normalization checking functions would be invoked by
setting
> > the parameters on the DOMConfiguration and calling normalizeDocument().
In
> > addition to plugging in the XNI component here it would also involve
> > updating the DOM with the normalized text. And when a DOM is loaded
with an
> > LSParser if the LSInput.certifiedText [3] flag is true, I believe the
> > intention is that normalization processing is skipped so should have
some
> > way to bypass the normalization component (e.g. excluding it from the
> > pipeline) when the input claims to be certified.
> >
>
> I had one question about another class called XML11Char which lets you
check
> if a character is a valid XML 1.1 character.  Should normalization checks
play
> any role in this validation?

XML11Char and its counterpart XMLChar are used for checking
well-formedness: the set of rules which all XML documents must conform to
(otherwise they're not XML). Well-formedness checking will have logically
occurred before the normalization checker / normalizer sees the data. I
wouldn't expect that you would need to call any of these methods again in
that context.

> Thanks,
> Richard
>
> [1] http://www.w3.org/TR/xml11/#charencoding
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [email protected]
E-mail: [email protected]

Re: GSOC 2009 - Unicode Normalization Project

Reply via email to