[
https://issues.apache.org/jira/browse/SOLR-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766727#action_12766727
]
Yonik Seeley commented on SOLR-1394:
------------------------------------
I've been testing this with a bunch of different HTML, and I don't see any
places where this is worse, and it prevents splitting of tokens when it
shouldn't.
Given that the splitting is clearly a bug, and that changes to this filter
won't affect the rest of Solr, I plan on committing this shortly.
Things still aren't perfect as far as offsets and highlighting, but this patch
makes it no worse.
I modified the solr.xml document to escape the '&' and then added the strip
char filter to the text field.
The query was héllo OR hello OR unicode
Before this patch: Good <em>unicode</em> support: héllo <em>(hell</em>o
with an accent over the e)
After this patch: Good <em>unicode</em> support: <em>héll</em>o
<em>(hell</em>o with é accent over the e)
> HTML stripper is splitting tokens
> ---------------------------------
>
> Key: SOLR-1394
> URL: https://issues.apache.org/jira/browse/SOLR-1394
> Project: Solr
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.4
> Reporter: Anders Melchiorsen
> Attachments: SOLR-1394.patch, SOLR-1394.patch
>
>
> The Solr HTML stripper is replacing any removed HTML with whitespace. This is
> to keep offsets correct for highlighting.
> However, as was already pointed out in SOLR-42, this means that any token
> containing an HTML entity will be split into several tokens. That makes the
> HTML stripper completely unreliable for international text (and any text is
> potentially interantional).
> The current code is actually deficient for BOTH highlighting and indexing,
> where the previous incarnation (that did not insert spaces) only had problems
> with highlighting.
> The only workaround is to not use entities at all, which is impossible in
> some situations and inconvenient in most situations. If the client is
> required to transform entities before handing it to Solr, it might as well be
> required to also strip tags, and then the HTML stripper would not be needed
> at all.
> Today, we have a better solution that can be used: offset correction. We can
> then avoid inserting extra whitespace, but still get correct offsets. The
> attached patch implements just that.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.