[ 
https://issues.apache.org/jira/browse/LUCENE-6065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218334#comment-14218334
 ] 

Michael McCandless commented on LUCENE-6065:
--------------------------------------------

+1

> remove "foreign readers" from merge, fix LeafReader instead.
> ------------------------------------------------------------
>
>                 Key: LUCENE-6065
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6065
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Robert Muir
>
> Currently, SegmentMerger has supported two classes of citizens being merged:
> # SegmentReader
> # "foreign reader" (e.g. some FilterReader)
> It does an instanceof check and executes the merge differently. In the 
> SegmentReader case: stored field and term vectors are bulk-merged, norms and 
> docvalues are transferred directly without piling up on the heap, CRC32 
> verification runs with IO locality of the data being merged, etc. Otherwise, 
> we treat it as a "foreign" reader and its slow.
> This is just the low-level, it gets worse as you wrap with more stuff. A 
> great example there is SortingMergePolicy: not only will it have the 
> low-level slowdowns listed above, it will e.g. cache/pile up OrdinalMaps for 
> all string docvalues fields being merged and other silliness that just makes 
> matters worse.
> Another use case is 5.0 users wishing to upgrade from fieldcache to 
> docvalues. This should be possible to implement with a simple incremental 
> transition based on a mergepolicy that uses UninvertingReader. But we 
> shouldnt populate internal fieldcache entries unnecessarily on merge and 
> spike RAM until all those segment cores are released, and other issues like 
> bulk merge of stored fields and not piling up norms should still work: its 
> completely unrelated.
> There are more problems we can fix if we clean this up, 
> checkindex/checkreader can run efficiently where it doesn't need to RAM spike 
> like merging, we can remove the checkIntegrity() method completely from 
> LeafReader, since it can always be accomplished on producers, etc. In general 
> it would be nice to just have one codepath for merging that is as efficient 
> as we can make it, and to support things like index modifications during 
> merge.
> I spent a few weeks writing 3 different implementations to fix this 
> (interface, optional abstract class, "fix LeafReader"), and the latter is the 
> only one i don't completely hate: I think our APIs should be efficient for 
> indexing as well as search.
> So the proposal is simple, its to instead refactor LeafReader to just require 
> the producer APIs as abstract methods (and FilterReaders should work on 
> that). The search-oriented APIs can just be final methods that defer to those.
> So we would add 5 abstract methods, but implement 10 current methods as final 
> based on those, and then merging would always be efficient.
> {code}
>   // new abstract codec-based apis
>   /** 
>    * Expert: retrieve thread-private TermVectorsReader
>    * @throws AlreadyClosedException if this reader is closed
>    * @lucene.internal 
>    */
>   protected abstract TermVectorsReader getTermVectorsReader();
>   /** 
>    * Expert: retrieve thread-private StoredFieldsReader
>    * @throws AlreadyClosedException if this reader is closed
>    * @lucene.internal 
>    */
>   protected abstract StoredFieldsReader getFieldsReader();
>   
>   /** 
>    * Expert: retrieve underlying NormsProducer
>    * @throws AlreadyClosedException if this reader is closed
>    * @lucene.internal 
>    */
>   protected abstract NormsProducer getNormsReader();
>   
>   /** 
>    * Expert: retrieve underlying DocValuesProducer
>    * @throws AlreadyClosedException if this reader is closed
>    * @lucene.internal 
>    */
>   protected abstract DocValuesProducer getDocValuesReader();
>   
>   /** 
>    * Expert: retrieve underlying FieldsProducer
>    * @throws AlreadyClosedException if this reader is closed
>    * @lucene.internal  
>    */
>   protected abstract FieldsProducer getPostingsReader();
>   // user/search oriented public apis based on the above
>   public final Fields fields();
>   public final void document(int, StoredFieldVisitor);
>   public final Fields getTermVectors(int);
>   public final NumericDocValues getNumericDocValues(String);
>   public final Bits getDocsWithField(String);
>   public final BinaryDocValues getBinaryDocValues(String);
>   public final SortedDocValues getSortedDocValues(String);
>   public final SortedNumericDocValues getSortedNumericDocValues(String);
>   public final SortedSetDocValues getSortedSetDocValues(String);
>   public final NumericDocValues getNormValues(String);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to