David Smiley created LUCENE-6034:
------------------------------------

             Summary: MemoryIndex should be able to wrap TermVector Terms
                 Key: LUCENE-6034
                 URL: https://issues.apache.org/jira/browse/LUCENE-6034
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/highlighter
            Reporter: David Smiley
            Assignee: David Smiley
             Fix For: 5.0


The default highlighter has a "WeightedSpanTermExtractor" that uses MemoryIndex 
for certain queries -- basically phrases, SpanQueries, and the like.  For lots 
of text, this aspect of highlighting is time consuming and consumes a fair 
amount of memory.  What also consumes memory is that it wraps the tokenStream 
in CachingTokenFilter in this case.  But if the underlying TokenStream is 
actually from TokenSources (wrapping TermVector Terms), this is all needless!  
Furthermore, MemoryIndex doesn't support payloads.

The patch here has 3 aspects to it:
* Internal refactoring to MemoryIndex to simplify it by maintaining the fields 
in a sorted state using a TreeMap.  The ramifications of this led to reduced 
LOC for this file, even with the other features I added.  It also puts the 
FieldInfo on the Info, and thus there's one less data structure to keep around. 
 I suppose if there are a huge variety of fields in MemoryIndex, the aggregated 
N*Log(N) field lookup could add up, but that seems very unlikely.  I also 
brought in the MemoryIndexNormDocValues as a simple anonymous inner class - 
it's super-simple after all, not worth having in a separate file.
* New MemoryIndex.addField(String fieldName, Terms) method.  In this case, 
MemoryIndex is providing the supporting wrappers around the underlying Terms so 
that it appears as an Index.  In so doing, MemoryIndex supports payloads for 
such fields.
* WeightedSpanTermExtractor now detects TokenSources' wrapping of Terms and it 
supplies this to MemoryIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to