[jira] [Updated] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

Timothy M. Rodriguez (JIRA) Thu, 27 Oct 2016 12:29:24 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Timothy M. Rodriguez updated LUCENE-7526:
-----------------------------------------
    Description: 
This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies by 
reducing reliance on creating or re-creating TokenStreams.

The primary changes are as follows:

* AnalysisOffsetStrategy - split into two offset strategies
  ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
MemoryIndex for producing Offsets
  ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
MemoryIndex.  Can only be used if the query distills down to terms and automata.

* TokenStream removal 
  ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
the memory index and then once consumed a new one was generated by uninverting 
the MemoryIndex back into a TokenStream if there were automata (wildcard/mtq 
queries) involved.  Now this is avoided, which should save memory and avoid a 
second pass over the data.
  ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
generating a TokenStream if automata are involved.
  ** PostingsWithTermVectorsOffsetStrategy - similar refactoring

* CompositePostingsEnum - aggregates several underlying PostingsEnums for 
wildcard/mtq queries.  This should improve relevancy by providing unified 
metrics for a wildcard across all it's term matches

* Added a HighlightFlag for enabling the newly separated 
TokenStreamOffsetStrategy since it can adversely affect passage relevancy

  was:
This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies by 
reducing reliance on creating or re-creating TokenStreams.

The primary changes are as follows:

* AnalysisOffsetStrategy - split into two offset strategies
  * MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
MemoryIndex for producing Offsets
  * TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
MemoryIndex.  Can only be used if the query distills down to terms and automata.

* TokenStream removal 
  * MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
the memory index and then once consumed a new one was generated by uninverting 
the MemoryIndex back into a TokenStream if there were automata (wildcard/mtq 
queries) involved.  Now this is avoided, which should save memory and avoid a 
second pass over the data.
  * TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
generating a TokenStream if automata are involved.
  * PostingsWithTermVectorsOffsetStrategy - similar refactoring

* CompositePostingsEnum - aggregates several underlying PostingsEnums for 
wildcard/mtq queries.  This should improve relevancy by providing unified 
metrics for a wildcard across all it's term matches

* Added a HighlightFlag for enabling the newly separated 
TokenStreamOffsetStrategy since it can adversely affect passage relevancy


> Improvements to UnifiedHighlighter OffsetStrategies
> ---------------------------------------------------
>
>                 Key: LUCENE-7526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7526
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Timothy M. Rodriguez
>            Priority: Minor
>              Labels: highlighter, unified-highlighter
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies 
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
>   ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
> MemoryIndex for producing Offsets
>   ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
> MemoryIndex.  Can only be used if the query distills down to terms and 
> automata.
> * TokenStream removal 
>   ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
> the memory index and then once consumed a new one was generated by 
> uninverting the MemoryIndex back into a TokenStream if there were automata 
> (wildcard/mtq queries) involved.  Now this is avoided, which should save 
> memory and avoid a second pass over the data.
>   ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
> generating a TokenStream if automata are involved.
>   ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for 
> wildcard/mtq queries.  This should improve relevancy by providing unified 
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated 
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

Reply via email to