[jira] Commented: (XERCESJ-1383) Adding Unicode Normalization support to Xerces2-J

Michael Glavassevich (JIRA) Sun, 12 Jul 2009 09:27:45 -0700

    [ 
https://issues.apache.org/jira/browse/XERCESJ-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730116#action_12730116
 ]


Michael Glavassevich commented on XERCESJ-1383:
-----------------------------------------------

Hi Richard, as I mentioned on the mailing list what you have so far is looking 
good. I do have a few suggestions:

For performance I think if a piece text was already determined to be normalized 
by the normalization checker you could pass the original string through without 
calling normalize. For example:

   public void comment(XMLString text, Augmentations augs) throws XNIException {
       boolean normalized = false;
        if (fCheckCharacters) {
           normalized = checkNormalized(text,0);
        }
        if (fDocumentHandler != null) {
            if (fCharacterNormalization && !normalized) {
                fDocumentHandler.comment(normalize(text),augs);
            }
            else {
                fDocumentHandler.comment(text,augs);
            }
        }    
    } // comment(XMLString,Augmentations)

I wonder if the new error message you added ("The XML characters are not fully 
normalized.") could contain some context about the error (e.g. the sequence of 
text which isn't normalized) that would help the user better understand what 
portion(s) of the document they would need to repair to make their document 
normalized.

Xerces guarantees that the String values in QNames and several other constructs 
have been internalized [1] (i.e. String.intern()). Applications rely on this as 
well as Xerces' internals which do reference comparison with '==' for 
performance reasons in many places instead of .equals(). We need to make sure 
that when we normalize one of those constructs that the String value that we 
pass down the pipeline has been internalized. This can be accomplished using 
the SymbolTable [2].

[1] http://xerces.apache.org/xerces2-j/features.html#string-interning
[2] 
http://xerces.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/util/SymbolTable.html

> Adding Unicode Normalization support to Xerces2-J 
> --------------------------------------------------
>
>                 Key: XERCESJ-1383
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1383
>             Project: Xerces2-J
>          Issue Type: New Feature
>          Components: DOM (Level 3 Core), SAX
>    Affects Versions: 2.9.1
>         Environment: All
>            Reporter: Richard Kelly
>            Assignee: Michael Glavassevich
>         Attachments: CharacterNormalizer.java, CharacterNormalizer.patch
>
>
> This feature will add support for Unicode character normalization and 
> normalization checking to Xerces.  Applications that use Xerces will be able 
> to produce fully normalized XML documents and verify that any XML documents 
> they process are fully normalised. 
> Adding this functionality will allow Xerces to meet the XML 1.1 W3C 
> Recommendation regarding character normalization and allow it to implement 
> the optional character normalization and normalization checking features 
> specified in the DOM Level 3 Core and SAX2.
> More specifically, the features to be implemented are:
> DOM Level 3 Core: "normalize-characters" [1]
> DOM Level 3 Core: "check-character-normalization" [2]
> SAX2: "unicode-normalization-checking" [3]
> [1] 
> http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-normalize-characters
> [2] 
> http://www.w3.org/TR/DOM-Level-3-Core/core.html#parameter-check-character-normalization
> [3] http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (XERCESJ-1383) Adding Unicode Normalization support to Xerces2-J

Reply via email to