OK, I got Robert's optimization working on the current trunk ... I
will open a Jira issue with the patch.
Mike
"robert engels" <[EMAIL PROTECTED]> wrote:
> I have looked into modifying FieldInfos to keep the fields sorted by
> field name, so the user would not be forced to add the fields in the
> same order.
>
> Sparse documents are really not a problem. Since after the first
> merge of that document it will pickup the other fields from the other
> segments, after which it will merge "as the same".
>
> I had to add getFieldInfos() to SegmentReader to make all of this
> work. I did not need to modify FieldInfos or FieldIno - I do the
> equality checks in SegmentMerger, and only perform them once.
>
> Code looks as follows:
>
> private final int mergeFields() throws IOException {
> fieldInfos = new FieldInfos(); // merge field names
> int docCount = 0;
> for (int i = 0; i < readers.size(); i++) {
> IndexReader reader = (IndexReader) readers.elementAt(i);
> if (reader instanceof SegmentReader) {
> SegmentReader sreader = (SegmentReader) reader;
> for (int j = 0; j < sreader.getFieldInfos().size(); j++) {
> FieldInfo fi = sreader.getFieldInfos().fieldInfo(j);
> fieldInfos.add(fi.name, fi.isIndexed, fi.storeTermVector,
> fi.storePositionWithTermVector, fi.storeOffsetWithTermVector, !
> reader.hasNorms(fi.name));
> }
> } else {
> addIndexed(reader, fieldInfos, reader.getFieldNames
> (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION_OFFSET), true,
> true, true);
> addIndexed(reader, fieldInfos, reader.getFieldNames
> (IndexReader.FieldOption.TERMVECTOR_WITH_POSITION), true, true, false);
> addIndexed(reader, fieldInfos, reader.getFieldNames
> (IndexReader.FieldOption.TERMVECTOR_WITH_OFFSET), true, false, true);
> addIndexed(reader, fieldInfos, reader.getFieldNames
> (IndexReader.FieldOption.TERMVECTOR), true, false, false);
> addIndexed(reader, fieldInfos, reader.getFieldNames
> (IndexReader.FieldOption.INDEXED), false, false, false);
> fieldInfos.add(reader.getFieldNames
> (IndexReader.FieldOption.UNINDEXED), false);
> }
> }
> fieldInfos.write(directory, segment + ".fnm");
>
> SegmentReader[] sreaders = new SegmentReader[readers.size()];
> for (int i = 0; i < readers.size(); i++) {
> IndexReader reader = (IndexReader) readers.elementAt(i);
> boolean same = reader.getFieldNames().size() == fieldInfos.size
> () && reader instanceof SegmentReader;
> if(same) {
> SegmentReader sreader = (SegmentReader) reader;
> for (int j = 0; same && j < fieldInfos.size(); j++) {
> same = fieldInfos.fieldName(j).equals(sreader.getFieldInfos
> ().fieldName(j));
> }
> if(same)
> sreaders[i] = sreader;
> }
> }
>
> byte[] buffer = new byte[1024];
>
> // merge field values
> FieldsWriter fieldsWriter = new FieldsWriter(directory, segment,
> fieldInfos);
>
> try {
> for (int i = 0; i < readers.size(); i++) {
> IndexReader reader = (IndexReader) readers.elementAt(i);
> SegmentReader sreader = sreaders[i];
> int maxDoc = reader.maxDoc();
> for (int j = 0; j < maxDoc; j++)
> if (!reader.isDeleted(j)) { // skip deleted docs
> if (sreader!=null) {
> int len = sreader.length(j);
> if (len > buffer.length) {
> buffer = new byte[len * 2];
> }
> sreader.document(buffer, j, len);
> fieldsWriter.addDocument(buffer, len);
> } else {
> fieldsWriter.addDocument(reader.document(j));
> }
> docCount++;
> }
> }
> } finally {
> fieldsWriter.close();
> }
> return docCount;
> }
>
>
> On Nov 1, 2007, at 10:47 AM, Yonik Seeley wrote:
>
> > On 11/1/07, Doron Cohen <[EMAIL PROTECTED]> wrote:
> >> My reading of Robert's suggestion is that when we know that
> >> FieldInfos of the resulted segment is identical to the
> >> FieldInfos of a certain (sub) segment being merged then
> >> there is no need to parse+rewrite the field data for all
> >> docs of that (sub)segment, rather they can be written as is.
> >
> > Ah right... so for sparse fields it really depends on the order
> > documents were added to the segment I imagine.
> > If a document w/o all fields is added first, I guess the field numbers
> > would be different in the segments. Also, people should take care to
> > add fields in the same order (first doc in the segment will define the
> > fieldname->fieldnumber ordering I think)
> >
> > -Yonik
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]