[ 
https://issues.apache.org/jira/browse/LANG-66?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607737#action_12607737
 ] 

weaver edited comment on LANG-66 at 6/24/08 12:37 PM:
--------------------------------------------------------------

The correct implementation for this should be:

1.  Escape all known unicode values (already being done)
2.  Remove or mask all values OUTSIDE the following allowed values:
    Allowed Whitespace: 0x9  0xA  0xD  0x20
    Range 1: 0x21 - 0xD7FF
    Range 2: 0xE000 - 0xFFFD
    Range 3: 0x10000 - 0x10FFFF

Anything not matching the above values that hasn't already been escaped, should 
be masked or removed.  What I do is write the hex value in place of the actual 
character:

Example, the evil 0x13 that gets copied out of MS word all the friggin time 
would look something like this:

[Unicode: 0x13]

I feel this is better than completely removing the character or replacing it 
with a generic "?" or something like that as it can be debugged much quicker 
from a data standpoint.

Reference: XML Specification, section 2.2 http://www.w3.org/TR/REC-xml/#charsets

      was (Author: weaver):
    The correct implementation for this should be:

1.  Escape all known unicode values (already being done)
2.  Remove or mask all values OUTSIDE the following allowed values:
    Allowed Whitespace: 0x9  0xA  0xD  
    Range 1: 0x21 - 0xD7FF
    Range 2: 0xE000 - 0xFFFD
    Range 3: 0x10000 - 0x10FFFF

Anything not matching the above values that hasn't already been escaped, should 
be masked or removed.  What I do is write the hex value in place of the actual 
character:

Example, the evil 0x13 that gets copied out of MS word all the friggin time 
would look something like this:

[Unicode: 0x13]

I feel this is better than completely removing the character or replacing it 
with a generic "?" or something like that as it can be debugged much quicker 
from a data standpoint.

Reference: XML Specification, section 2.2 http://www.w3.org/TR/REC-xml/#charsets
  
> [lang] StringEscaper.escapeXml() escapes characters > 0x7f
> ----------------------------------------------------------
>
>                 Key: LANG-66
>                 URL: https://issues.apache.org/jira/browse/LANG-66
>             Project: Commons Lang
>          Issue Type: Bug
>    Affects Versions: 2.1
>         Environment: Operating System: All
> Platform: All
>            Reporter: Sandor Vroemisse
>             Fix For: 3.0
>
>
> StringEscaper.escapeXml() escapes characters > 0x7f. That's both undesired and
> undocumented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to