[ 
https://issues.apache.org/jira/browse/SOLR-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067503#comment-16067503
 ] 

Jan Høydahl commented on SOLR-2328:
-----------------------------------

The resolution choice was not to mean that the given example is actually 
handled by Solr, but rather to question whether the filter should be modified 
to handle all kinds of invalid HTML. There are a million ways people can 
mis-type HTML markup and I think it would be wrong to try to guess - it would 
probably introduce other bugs.

So my reason to close this was more to question whether this exact example is 
such a common one that causes so much pain in real search indexes that it 
warrants a special fix.

My take is that HTMLStripCF should strip real HTML, and if anyone wants a more 
fuzzy filter, then the right cure is to create a new one, e.g. 
{{HTMLStripLenientCharFilter}} using https://jsoup.org/ or similar existing 
stuff.

> HTMLStripCharFilter Leaves Broken HTML Tags
> -------------------------------------------
>
>                 Key: SOLR-2328
>                 URL: https://issues.apache.org/jira/browse/SOLR-2328
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 1.4.1
>            Reporter: Jeff Nadler
>
> Some kinds of 'bad' HTML are missed by HTMLStripCharFilter.   For example, 
> the following invalid HTML:
>      <a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>
> Is filtered to:
>      <a href="http://www.twitter.com/ceonyc"@ceonyc
> I understand the challenge here, without the end > it's tough to know what to 
> do.  It turns out that real-world web pages are full of this kind of garbage 
> HTML, and browsers (impressively!) seem to handle this quite gracefully.   
> Plus, users in my app can search for 'href' and find lots of matches (that 
> don't appear to contain 'href') as a result.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to