Add hl.df (highlight-specific default field) param, so highlighting can have a
separate analysis path
-----------------------------------------------------------------------------------------------------
Key: SOLR-1910
URL: https://issues.apache.org/jira/browse/SOLR-1910
Project: Solr
Issue Type: Improvement
Components: highlighter
Affects Versions: 1.4
Reporter: Chris Harris
Attachments: SOLR-1910.patch
Summary: Patch adds a hl.df parameter, to help with (some) situations where the
highlighter currently uses the "wrong" analyzer for highlighting.
What: hl.df is like the normal df parameter, except that it takes effect only
during highlighting. (In fact the implementation is basically to temporarily
mess with the normal df parameter at the start of highlighting, and then
revert to the original value when highlighting is complete.) When hl.df is
specified, we make sure not to use the Query object that was parsed by
QueryComponent, but rather make our own. In the right circumstances anyway,
this means that a more appropriate analyzer gets used for highlighting.
Motivation: Currently, in a normal query+highlighting request, the highlighter
re-uses the Query object parsed by the QueryComponent. This can result in
incorrect highlights if the field being highlighted is of a different type than
the field being queried. In my particular case:
* My queries don't explicitly specify field names; they always rely on the
default field
* My default field for search is "body"
* body is a unigram-plus-bigram field. So, e.g. input "audit trail" gets
turned into tokens "audit / audit trail / trail". (This is a performance
optimzation.)
* If I try to highlight directly on "body", the highlights get screwed up.
(This is because the highlighter doesn't really support the kind of
"continuously overlapping" tokens generated by my analysis chain. In short, the
bigrams confuse the TokenGroup class.)
* To avoid these highlighting problems, I don't directly highlight "body", but
rather a "highlight" field, which has no bigram tokens. ("highlight" is
populated from "body" with a copyfield directive.)
* Without hl.df, I have a new class of highlighting problems. In particular,
if the user enters a phrase search (e.g. "audit trail"), then that phrase
appears unhighlighted in the highlighter output. The short version for why is
that the analyzer used to parse the query output a Query object that contains
bigrams, but the text that we're highlighting doesn't contain bigrams.
* With hl.df, the analyzers match up for highlight; the Query object used for
highlighting does _not_ contain bigrams, just like the "highlight" field.
(I realize it may help to expand the description of this use case, but I'm a
bit hurried right now.)
I wanted to throw this out there, partly in case people have any better
solutions. One variation on hl.df option that might be worth considering is
hl.UseHighlightedFieldAsDefaultField, which would create a new Query object not
just once at the start of highlighting, but separately for each particular
field that's getting highlighted.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]