[
https://issues.apache.org/jira/browse/XERCESJ-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730116#action_12730116
]
Michael Glavassevich commented on XERCESJ-1383:
-----------------------------------------------
Hi Richard, as I mentioned on the mailing list what you have so far is looking
good. I do have a few suggestions:
For performance I think if a piece text was already determined to be normalized
by the normalization checker you could pass the original string through without
calling normalize. For example:
public void comment(XMLString text, Augmentations augs) throws XNIException {
boolean normalized = false;
if (fCheckCharacters) {
normalized = checkNormalized(text,0);
}
if (fDocumentHandler != null) {
if (fCharacterNormalization && !normalized) {
fDocumentHandler.comment(normalize(text),augs);
}
else {
fDocumentHandler.comment(text,augs);
}
}
} // comment(XMLString,Augmentations)
I wonder if the new error message you added ("The XML characters are not fully
normalized.") could contain some context about the error (e.g. the sequence of
text which isn't normalized) that would help the user better understand what
portion(s) of the document they would need to repair to make their document
normalized.
Xerces guarantees that the String values in QNames and several other constructs
have been internalized [1] (i.e. String.intern()). Applications rely on this as
well as Xerces' internals which do reference comparison with '==' for
performance reasons in many places instead of .equals(). We need to make sure
that when we normalize one of those constructs that the String value that we
pass down the pipeline has been internalized. This can be accomplished using
the SymbolTable [2].
[1] http://xerces.apache.org/xerces2-j/features.html#string-interning
[2]
http://xerces.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/util/SymbolTable.html
> Adding Unicode Normalization support to Xerces2-J
> --------------------------------------------------
>
> Key: XERCESJ-1383
> URL: https://issues.apache.org/jira/browse/XERCESJ-1383
> Project: Xerces2-J
> Issue Type: New Feature
> Components: DOM (Level 3 Core), SAX
> Affects Versions: 2.9.1
> Environment: All
> Reporter: Richard Kelly
> Assignee: Michael Glavassevich
> Attachments: CharacterNormalizer.java, CharacterNormalizer.patch
>
>
> This feature will add support for Unicode character normalization and
> normalization checking to Xerces. Applications that use Xerces will be able
> to produce fully normalized XML documents and verify that any XML documents
> they process are fully normalised.
> Adding this functionality will allow Xerces to meet the XML 1.1 W3C
> Recommendation regarding character normalization and allow it to implement
> the optional character normalization and normalization checking features
> specified in the DOM Level 3 Core and SAX2.
> More specifically, the features to be implemented are:
> DOM Level 3 Core: "normalize-characters" [1]
> DOM Level 3 Core: "check-character-normalization" [2]
> SAX2: "unicode-normalization-checking" [3]
> [1]
> http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-normalize-characters
> [2]
> http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-check-character-normalization
> [3] http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]