[ 
https://issues.apache.org/jira/browse/LUCENE-8349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519770#comment-16519770
 ] 

David Smiley commented on LUCENE-8349:
--------------------------------------

This highlighter is impressive for not a lot of code! Great work [~romseygeek]! 
Some observations:
 * Offsets can be in postings, term vectors, or via analysis. I didn't think 
you'd get to all 3 in your first patch/PR but you did! You could probably 
"borrow" the "TermVectorFilteredLeafReader" mechanism from the UH to address 
the hybrid case.
 * I don't see any special multi-term query (e.g. wildcards) processing but I 
suspect MTQs will _just work_. However, I'm really concerned about the 
performance for when offsets are in postings, since the term explosion could be 
huge and slow to advance. FWIW the UH solves this in two ways. First it visits 
the Query tree looking for MTQs, and if found then switches away from offsets 
in postings to analysis as a better trade-off. Granted it won't _always_ be the 
best trade-off but it addresses the worst case. Also, if a simple term vector 
(no offsets) are present then it can overlays it with the real index using 
TermVectorFilteredLeafReader, thus avoiding analysis.
 * The so-called "requireFieldMatch" option seen elsewhere is mandatory here – 
the query must match with whatever fields it specifies. Perhaps this 
OffsetsReader thing could map all terms() asked for into the field to be 
highlighted?
 * Obviously requires Weight.matches on the underlying query tree to work. 
SpanQueries don't *yet* but I assume that'll be added soon in another issue.
 * I didn't notice any passage scoring/ranking.
 * I didn't notice any javadocs yet but you'll get there I assume.
 * I really like your reuse of a Lucene Document with TextField entries to 
represent a set of passages for the document's highlighted fields!
 * I like that it highlights per-field value as opposed to the PH/UH which 
highlights a joined string across all values, which is awkward for the UH. This 
will allow for better short-circuiting options in the future. *However*, I'm 
concerned about the performance implications from what I see here as I believe 
the highlighting process starts over again on the query for each value, 
effectively creating an O(N^2) problem as the underlying offsets needs to be 
skipped over to get to the right window over and over again.

BTW some complexity in the UH that I don't see here is related to query tree 
visiting, such as for MultiTermQueries and also for getting all the terms 
(granted the latter is easy and not much code). This information is put to good 
use by building a MemoryIndex collecting only the pertinent terms and not 
bothering with the rest.

If this highlighter moves forward, I figure at some point you're going to have 
to address visiting/walking queries (e.g. to look for MTQs) and/or perhaps 
rewriting them. Consider these related issues: LUCENE-8184 LUCENE-8160 
LUCENE-3041

> Highlighter based on Matches API
> --------------------------------
>
>                 Key: LUCENE-8349
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8349
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I started trying to integrate the Matches API into the UnifiedHighlighter, 
> but there's a fairly heavy impedance mismatch between the way the two of them 
> work (eg Matches doesn't give you freqs, it's entirely lazy, the UH tries to 
> do things by field rather than by doc).  So instead, I thought I'd try and 
> write a new highlighter based around Matches, and see what it looks like.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to