[
https://issues.apache.org/jira/browse/LUCENE-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185811#comment-13185811
]
Yonik Seeley commented on LUCENE-3690:
--------------------------------------
bq. The tests for the new implementation are a superset of the old
implementation's tests.
Unfortunately I'm not sure how much of a story the tests tell (and yes, that
would be my fault ;-)
My memory is rusty, but back in '05 when I coded this thing, I threw a lot
stuff we had lying around CNET at it, and also a lot of stuff downloaded from
the web (which I couldn't just copy-n-paste into a unit test obviously). I had
a heck of a time handling all the weird stuff that could appear inside script
tags, for example, and I don't think I see much of a test for that (again... my
fault.)
bq. I welcome more examples of junk HTML to add to the tests
Not saying the new one isn't great (and matching a lot of crap from the old one
is quite an achievement).
One can be sure that the current implementation doesn't always do the right
thing, but unfortunately "right" isn't well defined here considering the domain.
The cost to keeping around the current version for a little while seems minimal.
> JFlex-based HTMLStripCharFilter replacement
> -------------------------------------------
>
> Key: LUCENE-3690
> URL: https://issues.apache.org/jira/browse/LUCENE-3690
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Affects Versions: 3.5, 4.0
> Reporter: Steven Rowe
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3690.patch, LUCENE-3690.patch, LUCENE-3690.patch
>
>
> A JFlex-based HTMLStripCharFilter replacement would be more performant and
> easier to understand and maintain.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]