Doug Cutting wrote:
Shouldn't the search code already take care of that?
No, the search may return documents that happen to contain "Doug Cutting" and Google - the current highlighter implementation uses all query terms (ignoring any AND/OR() operators) and looks for matches. Ideally "Doug Cutting" shouldn't be highlighted in the document "Doug Cutting loves google" when I searched for ("Doug Cutting" AND lucene) OR google.
This is a nice-to-have and I suspect this is not an issue people feel strongly about. We could continue to ignore the complexities of representing the results of such boolean logic - most queries don't use it anyway.
The query should thus be compared to each potential highlight fragment. This evaluation is different than the whole-document evaluation performed by search. If no fragments match the entire query, then fragments should be selected which, considered together, match the entire query.
Is this based on the approach (I think you suggested before now) to chop the doc into fragment-sized docs held in a RAM directory and then query it to get the best fragments? I think it would prove difficult to identify the combination of fragments that ultimately satisfied a query which contained complex boolean logic.
My original idea for an approach was to let the queries initially generate a "heat map" which scored every token in the document. Any boolean queries which failed to be satisfied completely (eg the Doug AND lucene example) would not generate a score for its tokens. Phrase queries would only score the token occurences in the document where all tokens were grouped.
The highlighter would then use the heat map to pick the best "runs" of tokens.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]