Good morning,
we currently use Lucene 4.3 in our project. We automatically generate
PrefixQueries and we are passing the rewritten query to the Highlighter
to highlight search terms in the search result.
Up until a few days ago, we were using a
MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE because the
highlighter does not work with the ConstantScoreQueries generated by the
MultiTermQuery.ConstantScoreAutoRewrite. We have also set the
"maxClauseCount" to a very large number to avoid the
TooManyClausesException. This has worked well for years until now.
Now there have been some searches for "a b c" or "s t am p s" which
generated OutOfMemoryErrors, so we now use the ConstantScoreAutoRewrite
and accept that some terms are not highlighted in the search result.
However, I read in the changelog of Lucene 5.0 that
MultiTermQuery.ConstantScoreAutoRewrite was removed in favour of
MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE.
My problems:
1) PrefixQueries rewritten with a
MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE don't work with the default
Highlighter at all.
2) Passing the original query to the Highlighter directly worked in my
testcases, but without a very large dataset. I have noticed the the
WeightedSpanTermExtractor which is used by the Highlighter uses a
MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE so I fear if we do that, we
will get OutOfMemory again when somebody searches for "a b c".
What method do you suggest to highlight prefix-terms. I should also
mention that we are using a custom formatter and a custom
text-fragmenter. I have not found any tutorials for the
FastVectorHighlighter. The PostingsHighlighter might work but I'm not
sure how to implement custom fragment sizes.
Thanks in advance,
Nils Knappmeier
--
--
Nils Knappmeier | Software Engineer
intelligent views gmbh
Julius-Reiber-Str. 17 |64293 Darmstadt
Tel ++49(0)6151 - 5006-228 | Fax ++49(0)6151 - 5006-138
e-mail: [email protected] | www.i-views.de
Geschäftsführer: Jörg Kleinz, Klaus Reichenberger
Die Gesellschaft ist eingetragen beim Amtsgericht Darmstadt (Sitz der
Gesellschaft) Nr. HRB 7965
Diese E-Mail enthaelt vertrauliche und/oder rechtlich geschuetzte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtuemlich erhalten haben, informieren Sie bitte sofort den Absender und
loeschen Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe
dieser Mail ist nicht gestattet.
This e-mail may contain confidential and/or privileged information. If you are
not the intended recipient (or have received this e-mail in error) please
notify the sender immediately and delete this e-mail. Any unauthorised copying,
disclosure or distribution of the contents in this e-mail is strictly forbidden.