I have looked into modifying FieldInfos to keep the fields sorted by field name, so the user would not be forced to add the fields in the same order.

Sparse documents are really not a problem. Since after the first merge of that document it will pickup the other fields from the other segments, after which it will merge "as the same".

I had to add getFieldInfos() to SegmentReader to make all of this work. I did not need to modify FieldInfos or FieldIno - I do the equality checks in SegmentMerger, and only perform them once.

Code looks as follows:

  private final int mergeFields() throws IOException {
        fieldInfos = new FieldInfos(); // merge field names
        int docCount = 0;
        for (int i = 0; i < readers.size(); i++) {
            IndexReader reader = (IndexReader) readers.elementAt(i);
            if (reader instanceof SegmentReader) {
                SegmentReader sreader = (SegmentReader) reader;
                for (int j = 0; j < sreader.getFieldInfos().size(); j++) {
                    FieldInfo fi = sreader.getFieldInfos().fieldInfo(j);
fieldInfos.add(fi.name, fi.isIndexed, fi.storeTermVector, fi.storePositionWithTermVector, fi.storeOffsetWithTermVector, ! reader.hasNorms(fi.name));
                }
            } else {
addIndexed(reader, fieldInfos, reader.getFieldNames (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET), true, true, true); addIndexed(reader, fieldInfos, reader.getFieldNames (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION), true, true, false); addIndexed(reader, fieldInfos, reader.getFieldNames (IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET), true, false, true); addIndexed(reader, fieldInfos, reader.getFieldNames (IndexReader.FieldOption.TERMVECTOR), true, false, false); addIndexed(reader, fieldInfos, reader.getFieldNames (IndexReader.FieldOption.INDEXED), false, false, false); fieldInfos.add(reader.getFieldNames (IndexReader.FieldOption.UNINDEXED), false);
            }
        }
        fieldInfos.write(directory, segment + ".fnm");

        SegmentReader[] sreaders = new SegmentReader[readers.size()];
        for (int i = 0; i < readers.size(); i++) {
            IndexReader reader = (IndexReader) readers.elementAt(i);
boolean same = reader.getFieldNames().size() == fieldInfos.size () && reader instanceof SegmentReader;
            if(same) {
                SegmentReader sreader = (SegmentReader) reader;
                for (int j = 0; same && j < fieldInfos.size(); j++) {
same = fieldInfos.fieldName(j).equals(sreader.getFieldInfos ().fieldName(j));
                }
                if(same)
                    sreaders[i] = sreader;
            }
        }
        
        byte[] buffer = new byte[1024];

        // merge field values
FieldsWriter fieldsWriter = new FieldsWriter(directory, segment, fieldInfos);
        
        try {
            for (int i = 0; i < readers.size(); i++) {
                IndexReader reader = (IndexReader) readers.elementAt(i);
                SegmentReader sreader = sreaders[i];
                int maxDoc = reader.maxDoc();
                for (int j = 0; j < maxDoc; j++)
                    if (!reader.isDeleted(j)) { // skip deleted docs
                        if (sreader!=null) {
                            int len = sreader.length(j);
                            if (len > buffer.length) {
                                buffer = new byte[len * 2];
                            }
                            sreader.document(buffer, j, len);
                            fieldsWriter.addDocument(buffer, len);
                        } else {
                            fieldsWriter.addDocument(reader.document(j));
                        }
                        docCount++;
                    }
            }
        } finally {
            fieldsWriter.close();
        }
        return docCount;
    }


On Nov 1, 2007, at 10:47 AM, Yonik Seeley wrote:

On 11/1/07, Doron Cohen <[EMAIL PROTECTED]> wrote:
My reading of Robert's suggestion is that when we know that
FieldInfos of the resulted segment is identical to the
FieldInfos of a certain (sub) segment being merged then
there is no need to parse+rewrite the field data for all
docs of that (sub)segment, rather they can be written as is.

Ah right... so for sparse fields it really depends on the order
documents were added to the segment I imagine.
If a document w/o all fields is added first, I guess the field numbers
would be different in the segments.  Also, people should take care to
add fields in the same order (first doc in the segment will define the
fieldname->fieldnumber ordering I think)

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to