[
https://issues.apache.org/jira/browse/LUCENE-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steven Rowe updated LUCENE-3690:
--------------------------------
Attachment: LUCENE-3690-handle-utf16-surrogates.patch
Patch (excluding the re-generated .java scanner) that addresses the unpaired
surrogate numeric character entity failures uncovered by random testing, by
outputting REPLACEMENT CHARACTER U+FFFD, and adds the ability to interpret
properly paired UTF-16 surrogates as an above-BMP codepoint. Added tests to
cover all four combinations of hex & decimal surrogate numeric character
entities in surrogate pairs.
Also added {{@SuppressWarnings("fallthrough")}} to the JFlex-generated scanner
class, so that the 40+ warnings about switch case fall-throughs don't clutter
the output.
> JFlex-based HTMLStripCharFilter replacement
> -------------------------------------------
>
> Key: LUCENE-3690
> URL: https://issues.apache.org/jira/browse/LUCENE-3690
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Affects Versions: 3.5, 4.0
> Reporter: Steven Rowe
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: BaselineWarcTest.java, HTMLStripCharFilterWarcTest.java,
> JFlexHTMLStripCharFilterWarcTest.java,
> LUCENE-3690-handle-utf16-surrogates.patch, LUCENE-3690.patch,
> LUCENE-3690.patch, LUCENE-3690.patch, LUCENE-3690.patch, LUCENE-3690.patch,
> jenkins_test.patch
>
>
> A JFlex-based HTMLStripCharFilter replacement would be more performant and
> easier to understand and maintain.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]