[ 
https://issues.apache.org/jira/browse/SOLR-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067483#comment-16067483
 ] 

Steve Rowe edited comment on SOLR-2328 at 6/29/17 12:05 AM:
------------------------------------------------------------

Reopening to change the resolution.

I added this test to {{HTMLStripCharFilterTest}} and it succeeded, so this bug 
(?) is absolutely reproducible:

*edit*: changed to remove the reserved tag handling:

{code:java}
  public void testSOLR2328() throws Exception {
    String test = "<a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>";
    String gold =  "<a href=\"http://www.twitter.com/ceonyc\"@ceonyc";;
    assertHTMLStripsTo(test, gold, Collections.emptySet());
    String test2 = "<a href=\\\"http://www.twitter.com/ceonyc\\\"@ceonyc</a>";
    String gold2 =  "<a href=\"http://www.twitter.com/ceonyc\"@ceonyc";;
    assertHTMLStripsTo(test, gold, Collections.emptySet());
  }
{code}


was (Author: steve_rowe):
Reopening to change the resolution.

I added this test to {{HTMLStripCharFilterTest}} and it succeeded, so this bug 
(?) is absolutely reproducible:

{code:java}
  public void testSOLR2328() throws Exception {
    String test = "<a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>";
    String gold =  "<a href=\"http://www.twitter.com/ceonyc\"@ceonyc";;
    assertHTMLStripsTo(test, gold, new HashSet<>(Arrays.asList("reserved")));
    String test2 = "<a href=\\\"http://www.twitter.com/ceonyc\\\"@ceonyc</a>";
    String gold2 =  "<a href=\"http://www.twitter.com/ceonyc\"@ceonyc";;
    assertHTMLStripsTo(test, gold, new HashSet<>(Arrays.asList("reserved")));
  }
{code}

> HTMLStripCharFilter Leaves Broken HTML Tags
> -------------------------------------------
>
>                 Key: SOLR-2328
>                 URL: https://issues.apache.org/jira/browse/SOLR-2328
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4.1
>            Reporter: Jeff Nadler
>
> Some kinds of 'bad' HTML are missed by HTMLStripCharFilter.   For example, 
> the following invalid HTML:
>      <a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>
> Is filtered to:
>      <a href="http://www.twitter.com/ceonyc"@ceonyc
> I understand the challenge here, without the end > it's tough to know what to 
> do.  It turns out that real-world web pages are full of this kind of garbage 
> HTML, and browsers (impressively!) seem to handle this quite gracefully.   
> Plus, users in my app can search for 'href' and find lots of matches (that 
> don't appear to contain 'href') as a result.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to