[ https://issues.apache.org/jira/browse/LUCENE-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16869876#comment-16869876 ]
ASF subversion and git services commented on LUCENE-8848: --------------------------------------------------------- Commit 54cc70127b22083198f1c44f83ccf4cdf769ac77 in lucene-solr's branch refs/heads/master from David Wayne Smiley [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=54cc701 ] LUCENE-8848 LUCENE-7757 LUCENE-8492: UnifiedHighlighter.hasUnrecognizedQuery The UH now detects that parts of the query are not understood by it. When found, it highlights more safely/reliably. Fixes compatibility with complex and surround query parsers. > UnifiedHighlighter should highlight all Query types that implement > Weight.matches > --------------------------------------------------------------------------------- > > Key: LUCENE-8848 > URL: https://issues.apache.org/jira/browse/LUCENE-8848 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter > Reporter: David Smiley > Priority: Major > Attachments: LUCENE-8848.patch > > > The UnifiedHighlighter internally extracts terms and automata from the query. > Usually this works perfectly but it's possible a Query might be of a type it > doesn't know -- a leaf query that is perhaps in effect similar to a > MultiTermQuery yet it might not even be a subclass of this or it does but the > UH doesn't know how to extract an automata from it. The UH is oblivious to > this and probably won't highlight this query. If re-analysis of the text is > necessary, the UH will pre-filter all terms to only those it _thinks_ are > pertinent. Or if offsets are in the postings then the UH could perform very > poorly by unleashing this query on the index for each highlighted document > without recognizing re-analysis is a more appropriate path. > I think to solve this, the UnifiedHighlighter.getFieldHighlighter needs to > inspect the query (using a QueryVisitor) to see if it can find a leaf query > that is not one it knows how to pull automata from, and is otherwise not in a > special list (like MatchAllDocsQuery). If we find one, we avoid choosing > OffsetSource.POSTINGS or OffsetSource.NONE_NEEDED since we might in effect > have an MTQ like query. If a MemoryIndex is needed then we don't pre-filter > the terms since we can't assume we know precisely which terms are pertinent. > We needn't bother extracting terms & automata in this case either; it's > wasted effort which can involve building a CharacterRunAutomaton (see > MultiTermHighlighting.binaryToCharRunAutomaton). Speaking of which, it'd be > nice to avoid that in other cases as well, like for WEIGHT_MATCHES when we > aren't using MemoryIndex (thus no term pre-filtering). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org