[
https://issues.apache.org/jira/browse/SOLR-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jan Høydahl closed SOLR-2328.
-----------------------------
Resolution: Cannot Reproduce
Closing ancient issue as "cannot reproduce".
If anyone can illustrate that this is a real problem with real HTML content out
there, then please re-open this issue and include steps to reproduce and
suggestions for how to fix.
> HTMLStripCharFilter Leaves Broken HTML Tags
> -------------------------------------------
>
> Key: SOLR-2328
> URL: https://issues.apache.org/jira/browse/SOLR-2328
> Project: Solr
> Issue Type: Bug
> Components: Schema and Analysis
> Affects Versions: 1.4.1
> Reporter: Jeff Nadler
>
> Some kinds of 'bad' HTML are missed by HTMLStripCharFilter. For example,
> the following invalid HTML:
> <a href=\"http://www.twitter.com/ceonyc\"@ceonyc</a>
> Is filtered to:
> <a href="http://www.twitter.com/ceonyc"@ceonyc
> I understand the challenge here, without the end > it's tough to know what to
> do. It turns out that real-world web pages are full of this kind of garbage
> HTML, and browsers (impressively!) seem to handle this quite gracefully.
> Plus, users in my app can search for 'href' and find lots of matches (that
> don't appear to contain 'href') as a result.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]