[
https://issues.apache.org/jira/browse/SOLR-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Rowe reopened SOLR-2328:
------------------------------
Reopening to change the resolution.
I added this test to {{HTMLStripCharFilterTest}} and it succeeded, so this bug
(?) is absolutely reproducible:
{code:java}
public void testSOLR2328() throws Exception {
String test = "<a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>";
String gold = "<a href=\"http://www.twitter.com/ceonyc\"@ceonyc";
assertHTMLStripsTo(test, gold, new HashSet<>(Arrays.asList("reserved")));
String test2 = "<a href=\\\"http://www.twitter.com/ceonyc\\\"@ceonyc</a>";
String gold2 = "<a href=\"http://www.twitter.com/ceonyc\"@ceonyc";
assertHTMLStripsTo(test, gold, new HashSet<>(Arrays.asList("reserved")));
}
{code}
> HTMLStripCharFilter Leaves Broken HTML Tags
> -------------------------------------------
>
> Key: SOLR-2328
> URL: https://issues.apache.org/jira/browse/SOLR-2328
> Project: Solr
> Issue Type: Bug
> Components: Schema and Analysis
> Affects Versions: 1.4.1
> Reporter: Jeff Nadler
>
> Some kinds of 'bad' HTML are missed by HTMLStripCharFilter. For example,
> the following invalid HTML:
> <a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>
> Is filtered to:
> <a href="http://www.twitter.com/ceonyc"@ceonyc
> I understand the challenge here, without the end > it's tough to know what to
> do. It turns out that real-world web pages are full of this kind of garbage
> HTML, and browsers (impressively!) seem to handle this quite gracefully.
> Plus, users in my app can search for 'href' and find lots of matches (that
> don't appear to contain 'href') as a result.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]