[ 
https://issues.apache.org/jira/browse/SOLR-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe reopened SOLR-2328:
------------------------------

Reopening to change the resolution.

I added this test to {{HTMLStripCharFilterTest}} and it succeeded, so this bug 
(?) is absolutely reproducible:

{code:java}
  public void testSOLR2328() throws Exception {
    String test = "<a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>";
    String gold =  "<a href=\"http://www.twitter.com/ceonyc\"@ceonyc";;
    assertHTMLStripsTo(test, gold, new HashSet<>(Arrays.asList("reserved")));
    String test2 = "<a href=\\\"http://www.twitter.com/ceonyc\\\"@ceonyc</a>";
    String gold2 =  "<a href=\"http://www.twitter.com/ceonyc\"@ceonyc";;
    assertHTMLStripsTo(test, gold, new HashSet<>(Arrays.asList("reserved")));
  }
{code}

> HTMLStripCharFilter Leaves Broken HTML Tags
> -------------------------------------------
>
>                 Key: SOLR-2328
>                 URL: https://issues.apache.org/jira/browse/SOLR-2328
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4.1
>            Reporter: Jeff Nadler
>
> Some kinds of 'bad' HTML are missed by HTMLStripCharFilter.   For example, 
> the following invalid HTML:
>      <a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>
> Is filtered to:
>      <a href="http://www.twitter.com/ceonyc"@ceonyc
> I understand the challenge here, without the end > it's tough to know what to 
> do.  It turns out that real-world web pages are full of this kind of garbage 
> HTML, and browsers (impressively!) seem to handle this quite gracefully.   
> Plus, users in my app can search for 'href' and find lots of matches (that 
> don't appear to contain 'href') as a result.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to