Re: possible segment merge improvement?

Michael McCandless Fri, 02 Nov 2007 10:24:00 -0800

OK, I got Robert's optimization working on the current trunk ... I
will open a Jira issue with the patch.


Mike

"robert engels" <[EMAIL PROTECTED]> wrote:
> I have looked into modifying FieldInfos to keep the fields sorted by  
> field name, so the user would not be forced to add the fields in the  
> same order.
> 
> Sparse documents are really not a problem. Since after the first  
> merge of that document it will pickup the other fields from the other  
> segments, after which it will merge "as the same".
> 
> I had to add getFieldInfos() to SegmentReader to make all of this  
> work. I did not need to modify FieldInfos or FieldIno - I do the  
> equality checks in SegmentMerger, and only perform them once.
> 
> Code looks as follows:
> 
>    private final int mergeFields() throws IOException {
>       fieldInfos = new FieldInfos(); // merge field names
>       int docCount = 0;
>       for (int i = 0; i < readers.size(); i++) {
>           IndexReader reader = (IndexReader) readers.elementAt(i);
>           if (reader instanceof SegmentReader) {
>               SegmentReader sreader = (SegmentReader) reader;
>               for (int j = 0; j < sreader.getFieldInfos().size(); j++) {
>                   FieldInfo fi = sreader.getFieldInfos().fieldInfo(j);
>                   fieldInfos.add(fi.name, fi.isIndexed, fi.storeTermVector,  
> fi.storePositionWithTermVector, fi.storeOffsetWithTermVector, ! 
> reader.hasNorms(fi.name));
>               }
>           } else {
>               addIndexed(reader, fieldInfos, reader.getFieldNames 
> (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET), true,  
> true, true);
>               addIndexed(reader, fieldInfos, reader.getFieldNames 
> (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION), true, true, false);
>               addIndexed(reader, fieldInfos, reader.getFieldNames 
> (IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET), true, false, true);
>               addIndexed(reader, fieldInfos, reader.getFieldNames 
> (IndexReader.FieldOption.TERMVECTOR), true, false, false);
>               addIndexed(reader, fieldInfos, reader.getFieldNames 
> (IndexReader.FieldOption.INDEXED), false, false, false);
>               fieldInfos.add(reader.getFieldNames 
> (IndexReader.FieldOption.UNINDEXED), false);
>           }
>       }
>       fieldInfos.write(directory, segment + ".fnm");
> 
>       SegmentReader[] sreaders = new SegmentReader[readers.size()];
>       for (int i = 0; i < readers.size(); i++) {
>           IndexReader reader = (IndexReader) readers.elementAt(i);
>           boolean same = reader.getFieldNames().size() == fieldInfos.size 
> () && reader instanceof SegmentReader;
>           if(same) {
>               SegmentReader sreader = (SegmentReader) reader;
>               for (int j = 0; same && j < fieldInfos.size(); j++) {
>                   same = fieldInfos.fieldName(j).equals(sreader.getFieldInfos 
> ().fieldName(j));
>               }
>               if(same)
>                   sreaders[i] = sreader;
>           }
>       }
>       
>       byte[] buffer = new byte[1024];
> 
>       // merge field values
>       FieldsWriter fieldsWriter = new FieldsWriter(directory, segment,  
> fieldInfos);
>       
>       try {
>           for (int i = 0; i < readers.size(); i++) {
>               IndexReader reader = (IndexReader) readers.elementAt(i);
>               SegmentReader sreader = sreaders[i];
>               int maxDoc = reader.maxDoc();
>               for (int j = 0; j < maxDoc; j++)
>                   if (!reader.isDeleted(j)) { // skip deleted docs
>                       if (sreader!=null) {
>                           int len = sreader.length(j);
>                           if (len > buffer.length) {
>                               buffer = new byte[len * 2];
>                           }
>                           sreader.document(buffer, j, len);
>                           fieldsWriter.addDocument(buffer, len);
>                       } else {
>                           fieldsWriter.addDocument(reader.document(j));
>                       }
>                       docCount++;
>                   }
>           }
>       } finally {
>           fieldsWriter.close();
>       }
>       return docCount;
>      }
> 
> 
> On Nov 1, 2007, at 10:47 AM, Yonik Seeley wrote:
> 
> > On 11/1/07, Doron Cohen <[EMAIL PROTECTED]> wrote:
> >> My reading of Robert's suggestion is that when we know that
> >> FieldInfos of the resulted segment is identical to the
> >> FieldInfos of a certain (sub) segment being merged then
> >> there is no need to parse+rewrite the field data for all
> >> docs of that (sub)segment, rather they can be written as is.
> >
> > Ah right... so for sparse fields it really depends on the order
> > documents were added to the segment I imagine.
> > If a document w/o all fields is added first, I guess the field numbers
> > would be different in the segments.  Also, people should take care to
> > add fields in the same order (first doc in the segment will define the
> > fieldname->fieldnumber ordering I think)
> >
> > -Yonik
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: possible segment merge improvement?

Reply via email to