[ 
https://issues.apache.org/jira/browse/LUCENE-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190908#comment-13190908
 ] 

Steven Rowe edited comment on LUCENE-3690 at 1/23/12 7:31 AM:
--------------------------------------------------------------

Patch (excluding the re-generated .java scanner) that addresses the unpaired 
surrogate numeric character entity failures uncovered by random testing, by 
outputting REPLACEMENT CHARACTER U+FFFD, and adds the ability to interpret 
properly paired UTF-16 surrogates as an above-BMP codepoint.  Added tests to 
cover all four combinations of hex & decimal surrogate numeric character 
entities in surrogate pairs.

Also added {{@SuppressWarnings("fallthrough")}} to the JFlex-generated scanner 
class, so that the 40+ warnings about switch case fall-throughs don't clutter 
the output.

*Edit*: committing to trunk shortly.
                
      was (Author: steve_rowe):
    Patch (excluding the re-generated .java scanner) that addresses the 
unpaired surrogate numeric character entity failures uncovered by random 
testing, by outputting REPLACEMENT CHARACTER U+FFFD, and adds the ability to 
interpret properly paired UTF-16 surrogates as an above-BMP codepoint.  Added 
tests to cover all four combinations of hex & decimal surrogate numeric 
character entities in surrogate pairs.

Also added {{@SuppressWarnings("fallthrough")}} to the JFlex-generated scanner 
class, so that the 40+ warnings about switch case fall-throughs don't clutter 
the output.
                  
> JFlex-based HTMLStripCharFilter replacement
> -------------------------------------------
>
>                 Key: LUCENE-3690
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3690
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 3.5, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: BaselineWarcTest.java, HTMLStripCharFilterWarcTest.java, 
> JFlexHTMLStripCharFilterWarcTest.java, 
> LUCENE-3690-handle-utf16-surrogates.patch, LUCENE-3690.patch, 
> LUCENE-3690.patch, LUCENE-3690.patch, LUCENE-3690.patch, LUCENE-3690.patch, 
> jenkins_test.patch
>
>
> A JFlex-based HTMLStripCharFilter replacement would be more performant and 
> easier to understand and maintain.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to