[ https://issues.apache.org/jira/browse/LUCENE-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860110#comment-16860110 ]
David Smiley commented on LUCENE-8848: -------------------------------------- This patch will resolve linked issues relating to both the Surround query parser and ComplexPhrase query parser. > UnifiedHighlighter should highlight all Query types that implement > Weight.matches > --------------------------------------------------------------------------------- > > Key: LUCENE-8848 > URL: https://issues.apache.org/jira/browse/LUCENE-8848 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter > Reporter: David Smiley > Priority: Major > Attachments: LUCENE-8848.patch > > > The UnifiedHighlighter internally extracts terms and automata from the query. > Usually this works perfectly but it's possible a Query might be of a type it > doesn't know -- a leaf query that is perhaps in effect similar to a > MultiTermQuery yet it might not even be a subclass of this or it does but the > UH doesn't know how to extract an automata from it. The UH is oblivious to > this and probably won't highlight this query. If re-analysis of the text is > necessary, the UH will pre-filter all terms to only those it _thinks_ are > pertinent. Or if offsets are in the postings then the UH could perform very > poorly by unleashing this query on the index for each highlighted document > without recognizing re-analysis is a more appropriate path. > I think to solve this, the UnifiedHighlighter.getFieldHighlighter needs to > inspect the query (using a QueryVisitor) to see if it can find a leaf query > that is not one it knows how to pull automata from, and is otherwise not in a > special list (like MatchAllDocsQuery). If we find one, we avoid choosing > OffsetSource.POSTINGS or OffsetSource.NONE_NEEDED since we might in effect > have an MTQ like query. If a MemoryIndex is needed then we don't pre-filter > the terms since we can't assume we know precisely which terms are pertinent. > We needn't bother extracting terms & automata in this case either; it's > wasted effort which can involve building a CharacterRunAutomaton (see > MultiTermHighlighting.binaryToCharRunAutomaton). Speaking of which, it'd be > nice to avoid that in other cases as well, like for WEIGHT_MATCHES when we > aren't using MemoryIndex (thus no term pre-filtering). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org