Highlighter does not work with HTML content that's passed through
HTMLStrip*Tokenizer
-------------------------------------------------------------------------------------
Key: SOLR-57
URL: http://issues.apache.org/jira/browse/SOLR-57
Project: Solr
Issue Type: Bug
Components: search
Environment: Red Hat Linux 9, Tomcat 5.5.20
Reporter: Ho Yin Au
Priority: Minor
I have a fieldtype with the following definition:
<fieldtype name="htmltext" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
<filter class="solr.ISOLatin1AccentFilterFactory" />
</analyzer>
</fieldtype>
When fields with that definition are included in the list of fields to be
highlighted, the highlighted term is always offset because it does not take
into account the HTML tags before it, so you end up with something like this
for the highlighted snipplet:
Does your comptuer meet the <a
href="http:/<em>/www.example</em>.com/system_requirements.shtml">minimum system
requirements</a>?
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira