[
https://issues.apache.org/jira/browse/LUCENE-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190908#comment-13190908
]
Steven Rowe edited comment on LUCENE-3690 at 1/23/12 7:31 AM:
--------------------------------------------------------------
Patch (excluding the re-generated .java scanner) that addresses the unpaired
surrogate numeric character entity failures uncovered by random testing, by
outputting REPLACEMENT CHARACTER U+FFFD, and adds the ability to interpret
properly paired UTF-16 surrogates as an above-BMP codepoint. Added tests to
cover all four combinations of hex & decimal surrogate numeric character
entities in surrogate pairs.
Also added {{@SuppressWarnings("fallthrough")}} to the JFlex-generated scanner
class, so that the 40+ warnings about switch case fall-throughs don't clutter
the output.
*Edit*: committing to trunk shortly.
was (Author: steve_rowe):
Patch (excluding the re-generated .java scanner) that addresses the
unpaired surrogate numeric character entity failures uncovered by random
testing, by outputting REPLACEMENT CHARACTER U+FFFD, and adds the ability to
interpret properly paired UTF-16 surrogates as an above-BMP codepoint. Added
tests to cover all four combinations of hex & decimal surrogate numeric
character entities in surrogate pairs.
Also added {{@SuppressWarnings("fallthrough")}} to the JFlex-generated scanner
class, so that the 40+ warnings about switch case fall-throughs don't clutter
the output.
> JFlex-based HTMLStripCharFilter replacement
> -------------------------------------------
>
> Key: LUCENE-3690
> URL: https://issues.apache.org/jira/browse/LUCENE-3690
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Affects Versions: 3.5, 4.0
> Reporter: Steven Rowe
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: BaselineWarcTest.java, HTMLStripCharFilterWarcTest.java,
> JFlexHTMLStripCharFilterWarcTest.java,
> LUCENE-3690-handle-utf16-surrogates.patch, LUCENE-3690.patch,
> LUCENE-3690.patch, LUCENE-3690.patch, LUCENE-3690.patch, LUCENE-3690.patch,
> jenkins_test.patch
>
>
> A JFlex-based HTMLStripCharFilter replacement would be more performant and
> easier to understand and maintain.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]