[
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578802#comment-16578802
]
David Smiley commented on LUCENE-8286:
--------------------------------------
Made substantial progress to the PR:
{noformat}
LUCENE-8286 UH: Use MI.getSubMatches(). Removed PhraseHelper changes; not
necessary anymore.
Updated based on MI improvements in master.
With subMatches, we have better fidelity on span queries.
And since MI can handle span queries now, no need to touch PhraseHelper.
* added to UHComponents: query, and highlightFlags
* updated tests to handle with/without WEIGHT_MATCHES
* TestUnifiedHighlighterStrictPhrases uses more randomization.
Removed brittle score calculation dependence.
* Test Passage matches data is in order
TODO: OE freq & term()
{noformat}
It was nice to see that UH's PhraseHelper can be circumvented now. Handling
mi.getSubMatches proved to be difficult, but I ultimately got it working. See
https://github.com/dsmiley/lucene-solr/blob/LUCENE-8286/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/OffsetsEnum.java#L168
Next up is handling OffsetsEnum.getTerm(). I could change the API so that
getTerm() returns getQuery() and consequently update Passage & PassageScorer.
Callers of getTerm() were all internal or considered experimental any way
(definitely not in common use) so I think it could change in a minor release.
I hope multi-term query types will be retained as such but I fear
MatchesIterator expands before retaining the original, and thus the results
here won't be as ideal but adequate.
Then, OffsetsEnum.freq(). This one is hard. We could make "-1" an unsupported
value. Then, a new PassageScorer design that is created per highlighted field
value could be given access to the IndexReader in
org.apache.lucene.search.uhighlight.FieldHighlighter#highlightOffsetsEnums.
When it sees -1 at scoring time, it could calculate the in-doc freq and cache
it. Or similarly... maybe we don't care that much about the in-doc freq; it
may be expensive to calculate any way. Maybe we want the associated Query's
score for this document (which will consider global stats like IDF), but again
will need access to the IndexReader. It'd be nice if boosts wrapped around the
query could be considered but it's just not there (also true without MI mode).
> UnifiedHighlighter should support the new Weight.matches API for better match
> accuracy
> --------------------------------------------------------------------------------------
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: David Smiley
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing
> the LOC and related complexities, especially the UH's PhraseHelper. Note:
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches
> is experimental and incomplete, and perhaps we'll discover some gaps in
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum
> option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}?
> Longer term it could go away and it'll be implied if you specify enum values
> for PHRASES & MULTI_TERM_QUERY?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]