[
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated LUCENE-7526:
---------------------------------
Attachment: LUCENE-7526.patch
I understand RE MultiValueTokenStream. I was motivated by considering
highlighting a TokenStream without a MemoryIndex resulted in some complexity in
various places (not introduced here, I mean pre-existing); some of it is
addressed with this patch, and others still linger. Another hack that could
move is the getPayoad() returning the Term. If this offset strategy returned a
subclassed OffsetEnum, that quirk could be better isolated. Any way; let that
be for another day.
Attached is the patch delta from master. I'll commit tomorrow night.
> Improvements to UnifiedHighlighter OffsetStrategies
> ---------------------------------------------------
>
> Key: LUCENE-7526
> URL: https://issues.apache.org/jira/browse/LUCENE-7526
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: Timothy M. Rodriguez
> Assignee: David Smiley
> Priority: Minor
> Fix For: 6.4
>
> Attachments: LUCENE-7526.patch
>
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
> ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a
> MemoryIndex for producing Offsets
> ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a
> MemoryIndex. Can only be used if the query distills down to terms and
> automata.
> * TokenStream removal
> ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill
> the memory index and then once consumed a new one was generated by
> uninverting the MemoryIndex back into a TokenStream if there were automata
> (wildcard/mtq queries) involved. Now this is avoided, which should save
> memory and avoid a second pass over the data.
> ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid
> generating a TokenStream if automata are involved.
> ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for
> wildcard/mtq queries. This should improve relevancy by providing unified
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]