[
https://issues.apache.org/jira/browse/LUCENE-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185810#comment-13185810
]
Hoss Man commented on LUCENE-3690:
----------------------------------
bq. +1. I'll do this before committing anything.
i wouldn't be shy about committing the new impl + tests, i would just wait to
change the solr factory default behavior until we prove the perf is as good as
the existing one in some common cases, and if it's not, then re-evaluate the
names of the classes.
and by common cases i'm thinking...
* some test docs using typical wellformed html markup
* some test docs using malformed markup that require backtracking
* some test docs that contain almost no HTML at all (this is the one that i
have a hunch may be a big differentiator -- i've seen lots of people who use
the HTML stripper not becuase they expect HTML, but because they want to be
sure it doesn't get indexed if some stray html encoding sneaks into their data)
> JFlex-based HTMLStripCharFilter replacement
> -------------------------------------------
>
> Key: LUCENE-3690
> URL: https://issues.apache.org/jira/browse/LUCENE-3690
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Affects Versions: 3.5, 4.0
> Reporter: Steven Rowe
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3690.patch, LUCENE-3690.patch, LUCENE-3690.patch
>
>
> A JFlex-based HTMLStripCharFilter replacement would be more performant and
> easier to understand and maintain.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]