optimized disk usage when creating a compound index

Bernhard Messer Fri, 06 Aug 2004 00:53:20 -0700

hi developers,

i made some measurements on lucene disk usage during index creation. It's no surprise that during index creation, within the index optimization, more disk space is necessary than the final index size will reach. What i didn't expect is such a high difference in disk size usage, switching the compound file option on or off. Using the compound file option, the disk usage during index creation is more than 3 times higher than the final index size. This could be a pain in the neck, running projects like nutch, where huge datasets will be indexed. The grow rate relies on the fact that SegmentMerger creates the fully compound file first, before deleting the original, unused files. So i patched SegmentMerger and CompoundFileWriter classes in a way, that they will delete the file immediatly after copying the data within the compound. The result was, that we could reduce the necessary disk space from factor 3 to 2. The change forces to make some modifications within the TestCompoundFile class also. In several test methods the original file was compared to it's compound part. Using the modified SegmentMerger and CompoundFileWriter, the file was already deleted and couldn't be opened.

Here are some statistics about disk usage during index creation:

compound option is off:
final index size: 380 KB           max. diskspace used: 408 KB
final index size: 11079 KB       max. diskspace used: 11381 KB
final index size: 204148 KB      max. diskspace used: 20739 KB

using compound index:
final index size: 380 KB           max. diskspace used: 1145 KB
final index size: 11079 KB       max. diskspace used: 33544 KB
final index size: 204148 KB      max. diskspace used: 614977 KB

using compound index with patch:
final index size: 380 KB           max. diskspace used: 777 KB
final index size: 11079 KB       max. diskspace used: 22464 KB
final index size: 204148 KB      max. diskspace used: 410829

The change was tested under windows and linux without any negativ side effects. All JUnit test cases work fine. In the attachment you'll find all the necessary files:

SegmentMerger.java
CompoundFileWriter.java
TestCompoundFile.java

SegmentMerger.diff
CompoundFileWriter.diff
TestCompoundFile.diff

keep moving
Bernhard

Index: src/java/org/apache/lucene/index/CompoundFileWriter.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/CompoundFileWriter.java,v
retrieving revision 1.3
diff -r1.3 CompoundFileWriter.java
163a164,166
>                 
>                 // immediatly delete the copied file to safe disk-space
>                 directory.deleteFile((String) fe.file);

package org.apache.lucene.index;


/**
 * Copyright 2004 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.OutputStream;
import org.apache.lucene.store.InputStream;
import java.util.LinkedList;
import java.util.HashSet;
import java.util.Iterator;
import java.io.IOException;


/**
 * Combines multiple files into a single compound file.
 * The file format:<br>
 * <ul>
 *     <li>VInt fileCount</li>
 *     <li>{Directory}
 *         fileCount entries with the following structure:</li>
 *         <ul>
 *             <li>long dataOffset</li>
 *             <li>UTFString extension</li>
 *         </ul>
 *     <li>{File Data}
 *         fileCount entries with the raw data of the corresponding file</li>
 * </ul>
 *
 * The fileCount integer indicates how many files are contained in this compound
 * file. The {directory} that follows has that many entries. Each directory entry
 * contains an encoding identifier, an long pointer to the start of this file's
 * data section, and a UTF String with that file's extension.
 *
 * @author Dmitry Serebrennikov
 * @version $Id: CompoundFileWriter.java,v 1.3 2004/03/29 22:48:02 cutting Exp $
 */
final class CompoundFileWriter {

    private static final class FileEntry {
        /** source file */
        String file;

        /** temporary holder for the start of directory entry for this file */
        long directoryOffset;

        /** temporary holder for the start of this file's data section */
        long dataOffset;
    }


    private Directory directory;
    private String fileName;
    private HashSet ids;
    private LinkedList entries;
    private boolean merged = false;


    /** Create the compound stream in the specified file. The file name is the
     *  entire name (no extensions are added).
     */
    public CompoundFileWriter(Directory dir, String name) {
        if (dir == null)
            throw new IllegalArgumentException("Missing directory");
        if (name == null)
            throw new IllegalArgumentException("Missing name");

        directory = dir;
        fileName = name;
        ids = new HashSet();
        entries = new LinkedList();
    }

    /** Returns the directory of the compound file. */
    public Directory getDirectory() {
        return directory;
    }

    /** Returns the name of the compound file. */
    public String getName() {
        return fileName;
    }

    /** Add a source stream. If sourceDir is null, it is set to the
     *  same value as the directory where this compound stream exists.
     *  The id is the string by which the sub-stream will be know in the
     *  compound stream. The caller must ensure that the ID is unique. If the
     *  id is null, it is set to the name of the source file.
     */
    public void addFile(String file) {
        if (merged)
            throw new IllegalStateException(
                "Can't add extensions after merge has been called");

        if (file == null)
            throw new IllegalArgumentException(
                "Missing source file");

        if (! ids.add(file))
            throw new IllegalArgumentException(
                "File " + file + " already added");

        FileEntry entry = new FileEntry();
        entry.file = file;
        entries.add(entry);
    }

    /** Merge files with the extensions added up to now.
     *  All files with these extensions are combined sequentially into the
     *  compound stream. After successful merge, the source files
     *  are deleted.
     */
    public void close() throws IOException {
        if (merged)
            throw new IllegalStateException(
                "Merge already performed");

        if (entries.isEmpty())
            throw new IllegalStateException(
                "No entries to merge have been defined");

        merged = true;

        // open the compound stream
        OutputStream os = null;
        try {
            os = directory.createFile(fileName);

            // Write the number of entries
            os.writeVInt(entries.size());

            // Write the directory with all offsets at 0.
            // Remember the positions of directory entries so that we can
            // adjust the offsets later
            Iterator it = entries.iterator();
            while(it.hasNext()) {
                FileEntry fe = (FileEntry) it.next();
                fe.directoryOffset = os.getFilePointer();
                os.writeLong(0);    // for now
                os.writeString(fe.file);
            }

            // Open the files and copy their data into the stream.
            // Remeber the locations of each file's data section.
            byte buffer[] = new byte[1024];
            it = entries.iterator();
            while(it.hasNext()) {
                FileEntry fe = (FileEntry) it.next();
                fe.dataOffset = os.getFilePointer();
                copyFile(fe, os, buffer);
                
                // immediatly delete the copied file to safe disk-space
                directory.deleteFile((String) fe.file);
            }

            // Write the data offsets into the directory of the compound stream
            it = entries.iterator();
            while(it.hasNext()) {
                FileEntry fe = (FileEntry) it.next();
                os.seek(fe.directoryOffset);
                os.writeLong(fe.dataOffset);
            }

            // Close the output stream. Set the os to null before trying to
            // close so that if an exception occurs during the close, the
            // finally clause below will not attempt to close the stream
            // the second time.
            OutputStream tmp = os;
            os = null;
            tmp.close();

        } finally {
            if (os != null) try { os.close(); } catch (IOException e) { }
        }
    }

    /** Copy the contents of the file with specified extension into the
     *  provided output stream. Use the provided buffer for moving data
     *  to reduce memory allocation.
     */
    private void copyFile(FileEntry source, OutputStream os, byte buffer[])
    throws IOException
    {
        InputStream is = null;
        try {
            long startPtr = os.getFilePointer();

            is = directory.openFile(source.file);
            long length = is.length();
            long remainder = length;
            int chunk = buffer.length;

            while(remainder > 0) {
                int len = (int) Math.min(chunk, remainder);
                is.readBytes(buffer, 0, len);
                os.writeBytes(buffer, len);
                remainder -= len;
            }

            // Verify that remainder is 0
            if (remainder != 0)
                throw new IOException(
                    "Non-zero remainder length after copying: " + remainder
                    + " (id: " + source.file + ", length: " + length
                    + ", buffer size: " + chunk + ")");

            // Verify that the output length diff is equal to original file
            long endPtr = os.getFilePointer();
            long diff = endPtr - startPtr;
            if (diff != length)
                throw new IOException(
                    "Difference in the output file offsets " + diff
                    + " does not match the original file length " + length);

        } finally {
            if (is != null) is.close();
        }
    }
}

Index: src/java/org/apache/lucene/index/SegmentMerger.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/SegmentMerger.java,v
retrieving revision 1.11
diff -r1.11 SegmentMerger.java
151c151
<     // Perform the merge
---
>     // Perform the merge. Files will be deleted within CompoundFileWriter.close()
153,158c153
<         
<     // Now delete the source files
<     it = files.iterator();
<     while (it.hasNext()) {
<       directory.deleteFile((String) it.next());
<     }
---
>

package org.apache.lucene.index;

/**
 * Copyright 2004 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.util.Vector;
import java.util.ArrayList;
import java.util.Iterator;
import java.io.IOException;

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.OutputStream;
import org.apache.lucene.store.RAMOutputStream;

/**
 * The SegmentMerger class combines two or more Segments, represented by an 
IndexReader ([EMAIL PROTECTED] #add},
 * into a single Segment.  After adding the appropriate readers, call the merge method 
to combine the 
 * segments.
 *<P> 
 * If the compoundFile flag is set, then the segments will be merged into a compound 
file.
 *   
 * 
 * @see #merge
 * @see #add
 */
final class SegmentMerger {
  private boolean useCompoundFile;
  private Directory directory;
  private String segment;

  private Vector readers = new Vector();
  private FieldInfos fieldInfos;

  // File extensions of old-style index files
  private static final String COMPOUND_EXTENSIONS[] = new String[] {
    "fnm", "frq", "prx", "fdx", "fdt", "tii", "tis"
  };
  private static final String VECTOR_EXTENSIONS[] = new String[] {
    "tvx", "tvd", "tvf"
  };

  /**
   * 
   * @param dir The Directory to merge the other segments into
   * @param name The name of the new segment
   * @param compoundFile true if the new segment should use a compoundFile
   */
  SegmentMerger(Directory dir, String name, boolean compoundFile) {
    directory = dir;
    segment = name;
    useCompoundFile = compoundFile;
  }

  /**
   * Add an IndexReader to the collection of readers that are to be merged
   * @param reader
   */
  final void add(IndexReader reader) {
    readers.addElement(reader);
  }

  /**
   * 
   * @param i The index of the reader to return
   * @return The ith reader to be merged
   */
  final IndexReader segmentReader(int i) {
    return (IndexReader) readers.elementAt(i);
  }

  /**
   * Merges the readers specified by the [EMAIL PROTECTED] #add} method into the 
directory passed to the constructor
   * @return The number of documents that were merged
   * @throws IOException
   */
  final int merge() throws IOException {
    int value;
    
    value = mergeFields();
    mergeTerms();
    mergeNorms();

    if (fieldInfos.hasVectors())
      mergeVectors();

    if (useCompoundFile)
      createCompoundFile();

    return value;
  }
  
  /**
   * close all IndexReaders that have been added.
   * Should not be called before merge().
   * @throws IOException
   */
  final void closeReaders() throws IOException {
    for (int i = 0; i < readers.size(); i++) {  // close readers
      IndexReader reader = (IndexReader) readers.elementAt(i);
      reader.close();
    }
  }

  private final void createCompoundFile()
          throws IOException {
    CompoundFileWriter cfsWriter =
            new CompoundFileWriter(directory, segment + ".cfs");

    ArrayList files =
      new ArrayList(COMPOUND_EXTENSIONS.length + fieldInfos.size());    
    
    // Basic files
    for (int i = 0; i < COMPOUND_EXTENSIONS.length; i++) {
      files.add(segment + "." + COMPOUND_EXTENSIONS[i]);
    }

    // Field norm files
    for (int i = 0; i < fieldInfos.size(); i++) {
      FieldInfo fi = fieldInfos.fieldInfo(i);
      if (fi.isIndexed) {
        files.add(segment + ".f" + i);
      }
    }

    // Vector files
    if (fieldInfos.hasVectors()) {
      for (int i = 0; i < VECTOR_EXTENSIONS.length; i++) {
        files.add(segment + "." + VECTOR_EXTENSIONS[i]);
      }
    }

    // Now merge all added files
    Iterator it = files.iterator();
    while (it.hasNext()) {
      cfsWriter.addFile((String) it.next());
    }
    
    // Perform the merge. Files will be deleted within CompoundFileWriter.close()
    cfsWriter.close();

  }

  /**
   * 
   * @return The number of documents in all of the readers
   * @throws IOException
   */
  private final int mergeFields() throws IOException {
    fieldInfos = new FieldInfos();                // merge field names
    int docCount = 0;
    for (int i = 0; i < readers.size(); i++) {
      IndexReader reader = (IndexReader) readers.elementAt(i);
      fieldInfos.addIndexed(reader.getIndexedFieldNames(true), true);
      fieldInfos.addIndexed(reader.getIndexedFieldNames(false), false);
      fieldInfos.add(reader.getFieldNames(false), false);
    }
    fieldInfos.write(directory, segment + ".fnm");

    FieldsWriter fieldsWriter = // merge field values
            new FieldsWriter(directory, segment, fieldInfos);
    try {
      for (int i = 0; i < readers.size(); i++) {
        IndexReader reader = (IndexReader) readers.elementAt(i);
        int maxDoc = reader.maxDoc();
        for (int j = 0; j < maxDoc; j++)
          if (!reader.isDeleted(j)) {               // skip deleted docs
            fieldsWriter.addDocument(reader.document(j));
            docCount++;
          }
      }
    } finally {
      fieldsWriter.close();
    }
    return docCount;
  }

  /**
   * Merge the TermVectors from each of the segments into the new one.
   * @throws IOException
   */
  private final void mergeVectors() throws IOException {
    TermVectorsWriter termVectorsWriter = 
      new TermVectorsWriter(directory, segment, fieldInfos);

    try {
      for (int r = 0; r < readers.size(); r++) {
        IndexReader reader = (IndexReader) readers.elementAt(r);
        int maxDoc = reader.maxDoc();
        for (int docNum = 0; docNum < maxDoc; docNum++) {
          // skip deleted docs
          if (reader.isDeleted(docNum)) {
            continue;
          }
          termVectorsWriter.openDocument();

          // get all term vectors
          TermFreqVector[] sourceTermVector =
            reader.getTermFreqVectors(docNum);

          if (sourceTermVector != null) {
            for (int f = 0; f < sourceTermVector.length; f++) {
              // translate field numbers
              TermFreqVector termVector = sourceTermVector[f];
              termVectorsWriter.openField(termVector.getField());
              String [] terms = termVector.getTerms();
              int [] freqs = termVector.getTermFrequencies();
              
              for (int t = 0; t < terms.length; t++) {
                termVectorsWriter.addTerm(terms[t], freqs[t]);
              }
            }
            termVectorsWriter.closeDocument();
          }
        }
      }
    } finally {
      termVectorsWriter.close();
    }
  }

  private OutputStream freqOutput = null;
  private OutputStream proxOutput = null;
  private TermInfosWriter termInfosWriter = null;
  private int skipInterval;
  private SegmentMergeQueue queue = null;

  private final void mergeTerms() throws IOException {
    try {
      freqOutput = directory.createFile(segment + ".frq");
      proxOutput = directory.createFile(segment + ".prx");
      termInfosWriter =
              new TermInfosWriter(directory, segment, fieldInfos);
      skipInterval = termInfosWriter.skipInterval;
      queue = new SegmentMergeQueue(readers.size());

      mergeTermInfos();

    } finally {
      if (freqOutput != null) freqOutput.close();
      if (proxOutput != null) proxOutput.close();
      if (termInfosWriter != null) termInfosWriter.close();
      if (queue != null) queue.close();
    }
  }

  private final void mergeTermInfos() throws IOException {
    int base = 0;
    for (int i = 0; i < readers.size(); i++) {
      IndexReader reader = (IndexReader) readers.elementAt(i);
      TermEnum termEnum = reader.terms();
      SegmentMergeInfo smi = new SegmentMergeInfo(base, termEnum, reader);
      base += reader.numDocs();
      if (smi.next())
        queue.put(smi);                           // initialize queue
      else
        smi.close();
    }

    SegmentMergeInfo[] match = new SegmentMergeInfo[readers.size()];

    while (queue.size() > 0) {
      int matchSize = 0;                          // pop matching terms
      match[matchSize++] = (SegmentMergeInfo) queue.pop();
      Term term = match[0].term;
      SegmentMergeInfo top = (SegmentMergeInfo) queue.top();

      while (top != null && term.compareTo(top.term) == 0) {
        match[matchSize++] = (SegmentMergeInfo) queue.pop();
        top = (SegmentMergeInfo) queue.top();
      }

      mergeTermInfo(match, matchSize);            // add new TermInfo

      while (matchSize > 0) {
        SegmentMergeInfo smi = match[--matchSize];
        if (smi.next())
          queue.put(smi);                         // restore queue
        else
          smi.close();                            // done with a segment
      }
    }
  }

  private final TermInfo termInfo = new TermInfo(); // minimize consing

  /** Merge one term found in one or more segments. The array <code>smis</code>
   *  contains segments that are positioned at the same term. <code>N</code>
   *  is the number of cells in the array actually occupied.
   *
   * @param smis array of segments
   * @param n number of cells in the array actually occupied
   */
  private final void mergeTermInfo(SegmentMergeInfo[] smis, int n)
          throws IOException {
    long freqPointer = freqOutput.getFilePointer();
    long proxPointer = proxOutput.getFilePointer();

    int df = appendPostings(smis, n);             // append posting data

    long skipPointer = writeSkip();

    if (df > 0) {
      // add an entry to the dictionary with pointers to prox and freq files
      termInfo.set(df, freqPointer, proxPointer, (int) (skipPointer - freqPointer));
      termInfosWriter.add(smis[0].term, termInfo);
    }
  }

  /** Process postings from multiple segments all positioned on the
   *  same term. Writes out merged entries into freqOutput and
   *  the proxOutput streams.
   *
   * @param smis array of segments
   * @param n number of cells in the array actually occupied
   * @return number of documents across all segments where this term was found
   */
  private final int appendPostings(SegmentMergeInfo[] smis, int n)
          throws IOException {
    int lastDoc = 0;
    int df = 0;                                   // number of docs w/ term
    resetSkip();
    for (int i = 0; i < n; i++) {
      SegmentMergeInfo smi = smis[i];
      TermPositions postings = smi.postings;
      int base = smi.base;
      int[] docMap = smi.docMap;
      postings.seek(smi.termEnum);
      while (postings.next()) {
        int doc = postings.doc();
        if (docMap != null)
          doc = docMap[doc];                      // map around deletions
        doc += base;                              // convert to merged space

        if (doc < lastDoc)
          throw new IllegalStateException("docs out of order");

        df++;

        if ((df % skipInterval) == 0) {
          bufferSkip(lastDoc);
        }

        int docCode = (doc - lastDoc) << 1;       // use low bit to flag freq=1
        lastDoc = doc;

        int freq = postings.freq();
        if (freq == 1) {
          freqOutput.writeVInt(docCode | 1);      // write doc & freq=1
        } else {
          freqOutput.writeVInt(docCode);          // write doc
          freqOutput.writeVInt(freq);             // write frequency in doc
        }

        int lastPosition = 0;                     // write position deltas
        for (int j = 0; j < freq; j++) {
          int position = postings.nextPosition();
          proxOutput.writeVInt(position - lastPosition);
          lastPosition = position;
        }
      }
    }
    return df;
  }

  private RAMOutputStream skipBuffer = new RAMOutputStream();
  private int lastSkipDoc;
  private long lastSkipFreqPointer;
  private long lastSkipProxPointer;

  private void resetSkip() throws IOException {
    skipBuffer.reset();
    lastSkipDoc = 0;
    lastSkipFreqPointer = freqOutput.getFilePointer();
    lastSkipProxPointer = proxOutput.getFilePointer();
  }

  private void bufferSkip(int doc) throws IOException {
    long freqPointer = freqOutput.getFilePointer();
    long proxPointer = proxOutput.getFilePointer();

    skipBuffer.writeVInt(doc - lastSkipDoc);
    skipBuffer.writeVInt((int) (freqPointer - lastSkipFreqPointer));
    skipBuffer.writeVInt((int) (proxPointer - lastSkipProxPointer));

    lastSkipDoc = doc;
    lastSkipFreqPointer = freqPointer;
    lastSkipProxPointer = proxPointer;
  }

  private long writeSkip() throws IOException {
    long skipPointer = freqOutput.getFilePointer();
    skipBuffer.writeTo(freqOutput);
    return skipPointer;
  }

  private void mergeNorms() throws IOException {
    for (int i = 0; i < fieldInfos.size(); i++) {
      FieldInfo fi = fieldInfos.fieldInfo(i);
      if (fi.isIndexed) {
        OutputStream output = directory.createFile(segment + ".f" + i);
        try {
          for (int j = 0; j < readers.size(); j++) {
            IndexReader reader = (IndexReader) readers.elementAt(j);
            byte[] input = reader.norms(fi.name);
            int maxDoc = reader.maxDoc();
            for (int k = 0; k < maxDoc; k++) {
              byte norm = input != null ? input[k] : (byte) 0;
              if (!reader.isDeleted(k)) {
                output.writeByte(norm);
              }
            }
          }
        } finally {
          output.close();
        }
      }
    }
  }

}

Index: src/test/org/apache/lucene/index/TestCompoundFile.java
===================================================================
RCS file: 
/home/cvspublic/jakarta-lucene/src/test/org/apache/lucene/index/TestCompoundFile.java,v
retrieving revision 1.5
diff -r1.5 TestCompoundFile.java
20a21,24
> import java.util.Collection;
> import java.util.HashMap;
> import java.util.Iterator;
> import java.util.Map;
197a202,204
>             
>             InputStream expected = dir.openFile(name);
>             
203c210
<             InputStream expected = dir.openFile(name);
---
> 
206a214
>             
220a229,231
>         InputStream expected1 = dir.openFile("d1");
>         InputStream expected2 = dir.openFile("d2");
>         
227c238
<         InputStream expected = dir.openFile("d1");
---
>         
229,231c240,242
<         assertSameStreams("d1", expected, actual);
<         assertSameSeekBehavior("d1", expected, actual);
<         expected.close();
---
>         assertSameStreams("d1", expected1, actual);
>         assertSameSeekBehavior("d1", expected1, actual);
>         expected1.close();
234c245
<         expected = dir.openFile("d2");
---
>         
236,238c247,249
<         assertSameStreams("d2", expected, actual);
<         assertSameSeekBehavior("d2", expected, actual);
<         expected.close();
---
>         assertSameStreams("d2", expected2, actual);
>         assertSameSeekBehavior("d2", expected2, actual);
>         expected2.close();
270,271d280
<         // Now test
<         CompoundFileWriter csw = new CompoundFileWriter(dir, "test.cfs");
275a285,292
>         
>         InputStream[] check = new InputStream[data.length]; 
>         for (int i=0; i<data.length; i++) {
>            check[i] = dir.openFile(segment + data[i]);
>         }
>         
>         // Now test
>         CompoundFileWriter csw = new CompoundFileWriter(dir, "test.cfs");
283d299
<             InputStream check = dir.openFile(segment + data[i]);
285,286c301,302
<             assertSameStreams(data[i], check, test);
<             assertSameSeekBehavior(data[i], check, test);
---
>             assertSameStreams(data[i], check[i], test);
>             assertSameSeekBehavior(data[i], check[i], test);
288c304
<             check.close();
---
>             check[i].close();
299c315,316
<     private void setUp_2() throws IOException {
---
>     private Map setUp_2() throws IOException {
>               Map streams = new HashMap(20);
303a321,322
>             
>             streams.put("f" + i, dir.openFile("f" + i));
305a325,326
>         
>         return streams;
308c329,336
< 
---
>     private void closeUp(Map streams) throws IOException {
>       Iterator it = streams.values().iterator();
>       while (it.hasNext()) {
>               InputStream stream = (InputStream)it.next();
>               stream.close();
>       }
>     }
>     
364c392
<         setUp_2();
---
>         Map streams = setUp_2();
368c396
<         InputStream expected = dir.openFile("f11");
---
>         InputStream expected = (InputStream)streams.get("f11");
410c438,439
<         expected.close();
---
>         closeUp(streams);
>         
418c447
<         setUp_2();
---
>         Map streams = setUp_2();
422,423c451,452
<         InputStream e1 = dir.openFile("f11");
<         InputStream e2 = dir.openFile("f3");
---
>         InputStream e1 = (InputStream)streams.get("f11");
>         InputStream e2 = (InputStream)streams.get("f3");
426c455
<         InputStream a2 = dir.openFile("f3");
---
>         InputStream a2 = cr.openFile("f3");
486,487d514
<         e1.close();
<         e2.close();
490a518,519
>         
>         closeUp(streams);
497c526
<         setUp_2();
---
>         Map streams = setUp_2();
569a599,600
>         
>         closeUp(streams);
574c605
<         setUp_2();
---
>         Map streams = setUp_2();
587a619,620
>         
>         closeUp(streams);
592c625
<         setUp_2();
---
>         Map streams = setUp_2();
617a651,652
>         
>         closeUp(streams);

package org.apache.lucene.index;

/**
 * Copyright 2004 The Apache Software Foundation
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.IOException;
import java.io.File;
import java.util.Collection;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

import junit.framework.TestCase;
import junit.framework.TestSuite;
import junit.textui.TestRunner;
import org.apache.lucene.store.OutputStream;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.InputStream;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.store._TestHelper;


/**
 * @author [EMAIL PROTECTED]
 * @version $Id: TestCompoundFile.java,v 1.5 2004/03/29 22:48:06 cutting Exp $
 */
public class TestCompoundFile extends TestCase
{
    /** Main for running test case by itself. */
    public static void main(String args[]) {
        TestRunner.run (new TestSuite(TestCompoundFile.class));
//        TestRunner.run (new TestCompoundFile("testSingleFile"));
//        TestRunner.run (new TestCompoundFile("testTwoFiles"));
//        TestRunner.run (new TestCompoundFile("testRandomFiles"));
//        TestRunner.run (new TestCompoundFile("testClonedStreamsClosing"));
//        TestRunner.run (new TestCompoundFile("testReadAfterClose"));
//        TestRunner.run (new TestCompoundFile("testRandomAccess"));
//        TestRunner.run (new TestCompoundFile("testRandomAccessClones"));
//        TestRunner.run (new TestCompoundFile("testFileNotFound"));
//        TestRunner.run (new TestCompoundFile("testReadPastEOF"));

//        TestRunner.run (new TestCompoundFile("testIWCreate"));

    }


    private Directory dir;


    public void setUp() throws IOException {
        //dir = new RAMDirectory();
        dir = FSDirectory.getDirectory(new File(System.getProperty("tempDir"), 
"testIndex"), true);
    }


    /** Creates a file of the specified size with random data. */
    private void createRandomFile(Directory dir, String name, int size)
    throws IOException
    {
        OutputStream os = dir.createFile(name);
        for (int i=0; i<size; i++) {
            byte b = (byte) (Math.random() * 256);
            os.writeByte(b);
        }
        os.close();
    }

    /** Creates a file of the specified size with sequential data. The first
     *  byte is written as the start byte provided. All subsequent bytes are
     *  computed as start + offset where offset is the number of the byte.
     */
    private void createSequenceFile(Directory dir,
                                    String name,
                                    byte start,
                                    int size)
    throws IOException
    {
        OutputStream os = dir.createFile(name);
        for (int i=0; i < size; i++) {
            os.writeByte(start);
            start ++;
        }
        os.close();
    }


    private void assertSameStreams(String msg,
                                   InputStream expected,
                                   InputStream test)
    throws IOException
    {
        assertNotNull(msg + " null expected", expected);
        assertNotNull(msg + " null test", test);
        assertEquals(msg + " length", expected.length(), test.length());
        assertEquals(msg + " position", expected.getFilePointer(),
                                        test.getFilePointer());

        byte expectedBuffer[] = new byte[512];
        byte testBuffer[] = new byte[expectedBuffer.length];

        long remainder = expected.length() - expected.getFilePointer();
        while(remainder > 0) {
            int readLen = (int) Math.min(remainder, expectedBuffer.length);
            expected.readBytes(expectedBuffer, 0, readLen);
            test.readBytes(testBuffer, 0, readLen);
            assertEqualArrays(msg + ", remainder " + remainder, expectedBuffer,
                testBuffer, 0, readLen);
            remainder -= readLen;
        }
    }


    private void assertSameStreams(String msg,
                                   InputStream expected,
                                   InputStream actual,
                                   long seekTo)
    throws IOException
    {
        if(seekTo >= 0 && seekTo < expected.length())
        {
            expected.seek(seekTo);
            actual.seek(seekTo);
            assertSameStreams(msg + ", seek(mid)", expected, actual);
        }
    }



    private void assertSameSeekBehavior(String msg,
                                        InputStream expected,
                                        InputStream actual)
    throws IOException
    {
        // seek to 0
        long point = 0;
        assertSameStreams(msg + ", seek(0)", expected, actual, point);

        // seek to middle
        point = expected.length() / 2l;
        assertSameStreams(msg + ", seek(mid)", expected, actual, point);

        // seek to end - 2
        point = expected.length() - 2;
        assertSameStreams(msg + ", seek(end-2)", expected, actual, point);

        // seek to end - 1
        point = expected.length() - 1;
        assertSameStreams(msg + ", seek(end-1)", expected, actual, point);

        // seek to the end
        point = expected.length();
        assertSameStreams(msg + ", seek(end)", expected, actual, point);

        // seek past end
        point = expected.length() + 1;
        assertSameStreams(msg + ", seek(end+1)", expected, actual, point);
    }


    private void assertEqualArrays(String msg,
                                   byte[] expected,
                                   byte[] test,
                                   int start,
                                   int len)
    {
        assertNotNull(msg + " null expected", expected);
        assertNotNull(msg + " null test", test);

        for (int i=start; i<len; i++) {
            assertEquals(msg + " " + i, expected[i], test[i]);
        }
    }


    // ===========================================================
    //  Tests of the basic CompoundFile functionality
    // ===========================================================


    /** This test creates compound file based on a single file.
     *  Files of different sizes are tested: 0, 1, 10, 100 bytes.
     */
    public void testSingleFile() throws IOException {
        int data[] = new int[] { 0, 1, 10, 100 };
        for (int i=0; i<data.length; i++) {
            String name = "t" + data[i];
            createSequenceFile(dir, name, (byte) 0, data[i]);
            
            InputStream expected = dir.openFile(name);
            
            CompoundFileWriter csw = new CompoundFileWriter(dir, name + ".cfs");
            csw.addFile(name);
            csw.close();

            CompoundFileReader csr = new CompoundFileReader(dir, name + ".cfs");

            InputStream actual = csr.openFile(name);
            assertSameStreams(name, expected, actual);
            assertSameSeekBehavior(name, expected, actual);
            
            expected.close();
            actual.close();
            csr.close();
        }
    }


    /** This test creates compound file based on two files.
     *
     */
    public void testTwoFiles() throws IOException {
        createSequenceFile(dir, "d1", (byte) 0, 15);
        createSequenceFile(dir, "d2", (byte) 0, 114);

        InputStream expected1 = dir.openFile("d1");
        InputStream expected2 = dir.openFile("d2");
        
        CompoundFileWriter csw = new CompoundFileWriter(dir, "d.csf");
        csw.addFile("d1");
        csw.addFile("d2");
        csw.close();

        CompoundFileReader csr = new CompoundFileReader(dir, "d.csf");
        
        InputStream actual = csr.openFile("d1");
        assertSameStreams("d1", expected1, actual);
        assertSameSeekBehavior("d1", expected1, actual);
        expected1.close();
        actual.close();

        
        actual = csr.openFile("d2");
        assertSameStreams("d2", expected2, actual);
        assertSameSeekBehavior("d2", expected2, actual);
        expected2.close();
        actual.close();
        csr.close();
    }

    /** This test creates a compound file based on a large number of files of
     *  various length. The file content is generated randomly. The sizes range
     *  from 0 to 1Mb. Some of the sizes are selected to test the buffering
     *  logic in the file reading code. For this the chunk variable is set to
     *  the length of the buffer used internally by the compound file logic.
     */
    public void testRandomFiles() throws IOException {
        // Setup the test segment
        String segment = "test";
        int chunk = 1024; // internal buffer size used by the stream
        createRandomFile(dir, segment + ".zero", 0);
        createRandomFile(dir, segment + ".one", 1);
        createRandomFile(dir, segment + ".ten", 10);
        createRandomFile(dir, segment + ".hundred", 100);
        createRandomFile(dir, segment + ".big1", chunk);
        createRandomFile(dir, segment + ".big2", chunk - 1);
        createRandomFile(dir, segment + ".big3", chunk + 1);
        createRandomFile(dir, segment + ".big4", 3 * chunk);
        createRandomFile(dir, segment + ".big5", 3 * chunk - 1);
        createRandomFile(dir, segment + ".big6", 3 * chunk + 1);
        createRandomFile(dir, segment + ".big7", 1000 * chunk);

        // Setup extraneous files
        createRandomFile(dir, "onetwothree", 100);
        createRandomFile(dir, segment + ".notIn", 50);
        createRandomFile(dir, segment + ".notIn2", 51);

        final String data[] = new String[] {
            ".zero", ".one", ".ten", ".hundred", ".big1", ".big2", ".big3",
            ".big4", ".big5", ".big6", ".big7"
        };
        
        InputStream[] check = new InputStream[data.length]; 
        for (int i=0; i<data.length; i++) {
           check[i] = dir.openFile(segment + data[i]);
        }
        
        // Now test
        CompoundFileWriter csw = new CompoundFileWriter(dir, "test.cfs");
        for (int i=0; i<data.length; i++) {
            csw.addFile(segment + data[i]);
        }
        csw.close();

        CompoundFileReader csr = new CompoundFileReader(dir, "test.cfs");
        for (int i=0; i<data.length; i++) {
            InputStream test = csr.openFile(segment + data[i]);
            assertSameStreams(data[i], check[i], test);
            assertSameSeekBehavior(data[i], check[i], test);
            test.close();
            check[i].close();
        }
        csr.close();
    }


    /** Setup a larger compound file with a number of components, each of
     *  which is a sequential file (so that we can easily tell that we are
     *  reading in the right byte). The methods sets up 20 files - f0 to f19,
     *  the size of each file is 1000 bytes.
     */
    private Map setUp_2() throws IOException {
                Map streams = new HashMap(20);
        CompoundFileWriter cw = new CompoundFileWriter(dir, "f.comp");
        for (int i=0; i<20; i++) {
            createSequenceFile(dir, "f" + i, (byte) 0, 2000);
            cw.addFile("f" + i);
            
            streams.put("f" + i, dir.openFile("f" + i));
        }
        cw.close();
        
        return streams;
    }

    private void closeUp(Map streams) throws IOException {
        Iterator it = streams.values().iterator();
        while (it.hasNext()) {
                InputStream stream = (InputStream)it.next();
                stream.close();
        }
    }
    
    public void testReadAfterClose() throws IOException {
        demo_FSInputStreamBug((FSDirectory) dir, "test");
    }

    private void demo_FSInputStreamBug(FSDirectory fsdir, String file)
    throws IOException
    {
        // Setup the test file - we need more than 1024 bytes
        OutputStream os = fsdir.createFile(file);
        for(int i=0; i<2000; i++) {
            os.writeByte((byte) i);
        }
        os.close();

        InputStream in = fsdir.openFile(file);

        // This read primes the buffer in InputStream
        byte b = in.readByte();

        // Close the file
        in.close();

        // ERROR: this call should fail, but succeeds because the buffer
        // is still filled
        b = in.readByte();

        // ERROR: this call should fail, but succeeds for some reason as well
        in.seek(1099);

        try {
            // OK: this call correctly fails. We are now past the 1024 internal
            // buffer, so an actual IO is attempted, which fails
            b = in.readByte();
        } catch (IOException e) {
        }
    }


    static boolean isCSInputStream(InputStream is) {
        return is instanceof CompoundFileReader.CSInputStream;
    }

    static boolean isCSInputStreamOpen(InputStream is) throws IOException {
        if (isCSInputStream(is)) {
            CompoundFileReader.CSInputStream cis =
            (CompoundFileReader.CSInputStream) is;

            return _TestHelper.isFSInputStreamOpen(cis.base);
        } else {
            return false;
        }
    }


    public void testClonedStreamsClosing() throws IOException {
        Map streams = setUp_2();
        CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");

        // basic clone
        InputStream expected = (InputStream)streams.get("f11");
        assertTrue(_TestHelper.isFSInputStreamOpen(expected));

        InputStream one = cr.openFile("f11");
        assertTrue(isCSInputStreamOpen(one));

        InputStream two = (InputStream) one.clone();
        assertTrue(isCSInputStreamOpen(two));

        assertSameStreams("basic clone one", expected, one);
        expected.seek(0);
        assertSameStreams("basic clone two", expected, two);

        // Now close the first stream
        one.close();
        assertTrue("Only close when cr is closed", isCSInputStreamOpen(one));

        // The following should really fail since we couldn't expect to
        // access a file once close has been called on it (regardless of
        // buffering and/or clone magic)
        expected.seek(0);
        two.seek(0);
        assertSameStreams("basic clone two/2", expected, two);


        // Now close the compound reader
        cr.close();
        assertFalse("Now closed one", isCSInputStreamOpen(one));
        assertFalse("Now closed two", isCSInputStreamOpen(two));

        // The following may also fail since the compound stream is closed
        expected.seek(0);
        two.seek(0);
        //assertSameStreams("basic clone two/3", expected, two);


        // Now close the second clone
        two.close();
        expected.seek(0);
        two.seek(0);
        //assertSameStreams("basic clone two/4", expected, two);

        closeUp(streams);
        
    }


    /** This test opens two files from a compound stream and verifies that
     *  their file positions are independent of each other.
     */
    public void testRandomAccess() throws IOException {
        Map streams = setUp_2();
        CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");

        // Open two files
        InputStream e1 = (InputStream)streams.get("f11");
        InputStream e2 = (InputStream)streams.get("f3");

        InputStream a1 = cr.openFile("f11");
        InputStream a2 = cr.openFile("f3");

        // Seek the first pair
        e1.seek(100);
        a1.seek(100);
        assertEquals(100, e1.getFilePointer());
        assertEquals(100, a1.getFilePointer());
        byte be1 = e1.readByte();
        byte ba1 = a1.readByte();
        assertEquals(be1, ba1);

        // Now seek the second pair
        e2.seek(1027);
        a2.seek(1027);
        assertEquals(1027, e2.getFilePointer());
        assertEquals(1027, a2.getFilePointer());
        byte be2 = e2.readByte();
        byte ba2 = a2.readByte();
        assertEquals(be2, ba2);

        // Now make sure the first one didn't move
        assertEquals(101, e1.getFilePointer());
        assertEquals(101, a1.getFilePointer());
        be1 = e1.readByte();
        ba1 = a1.readByte();
        assertEquals(be1, ba1);

        // Now more the first one again, past the buffer length
        e1.seek(1910);
        a1.seek(1910);
        assertEquals(1910, e1.getFilePointer());
        assertEquals(1910, a1.getFilePointer());
        be1 = e1.readByte();
        ba1 = a1.readByte();
        assertEquals(be1, ba1);

        // Now make sure the second set didn't move
        assertEquals(1028, e2.getFilePointer());
        assertEquals(1028, a2.getFilePointer());
        be2 = e2.readByte();
        ba2 = a2.readByte();
        assertEquals(be2, ba2);

        // Move the second set back, again cross the buffer size
        e2.seek(17);
        a2.seek(17);
        assertEquals(17, e2.getFilePointer());
        assertEquals(17, a2.getFilePointer());
        be2 = e2.readByte();
        ba2 = a2.readByte();
        assertEquals(be2, ba2);

        // Finally, make sure the first set didn't move
        // Now make sure the first one didn't move
        assertEquals(1911, e1.getFilePointer());
        assertEquals(1911, a1.getFilePointer());
        be1 = e1.readByte();
        ba1 = a1.readByte();
        assertEquals(be1, ba1);

        a1.close();
        a2.close();
        cr.close();
        
        closeUp(streams);
    }

    /** This test opens two files from a compound stream and verifies that
     *  their file positions are independent of each other.
     */
    public void testRandomAccessClones() throws IOException {
        Map streams = setUp_2();
        CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");

        // Open two files
        InputStream e1 = cr.openFile("f11");
        InputStream e2 = cr.openFile("f3");

        InputStream a1 = (InputStream) e1.clone();
        InputStream a2 = (InputStream) e2.clone();

        // Seek the first pair
        e1.seek(100);
        a1.seek(100);
        assertEquals(100, e1.getFilePointer());
        assertEquals(100, a1.getFilePointer());
        byte be1 = e1.readByte();
        byte ba1 = a1.readByte();
        assertEquals(be1, ba1);

        // Now seek the second pair
        e2.seek(1027);
        a2.seek(1027);
        assertEquals(1027, e2.getFilePointer());
        assertEquals(1027, a2.getFilePointer());
        byte be2 = e2.readByte();
        byte ba2 = a2.readByte();
        assertEquals(be2, ba2);

        // Now make sure the first one didn't move
        assertEquals(101, e1.getFilePointer());
        assertEquals(101, a1.getFilePointer());
        be1 = e1.readByte();
        ba1 = a1.readByte();
        assertEquals(be1, ba1);

        // Now more the first one again, past the buffer length
        e1.seek(1910);
        a1.seek(1910);
        assertEquals(1910, e1.getFilePointer());
        assertEquals(1910, a1.getFilePointer());
        be1 = e1.readByte();
        ba1 = a1.readByte();
        assertEquals(be1, ba1);

        // Now make sure the second set didn't move
        assertEquals(1028, e2.getFilePointer());
        assertEquals(1028, a2.getFilePointer());
        be2 = e2.readByte();
        ba2 = a2.readByte();
        assertEquals(be2, ba2);

        // Move the second set back, again cross the buffer size
        e2.seek(17);
        a2.seek(17);
        assertEquals(17, e2.getFilePointer());
        assertEquals(17, a2.getFilePointer());
        be2 = e2.readByte();
        ba2 = a2.readByte();
        assertEquals(be2, ba2);

        // Finally, make sure the first set didn't move
        // Now make sure the first one didn't move
        assertEquals(1911, e1.getFilePointer());
        assertEquals(1911, a1.getFilePointer());
        be1 = e1.readByte();
        ba1 = a1.readByte();
        assertEquals(be1, ba1);

        e1.close();
        e2.close();
        a1.close();
        a2.close();
        cr.close();
        
        closeUp(streams);
    }


    public void testFileNotFound() throws IOException {
        Map streams = setUp_2();
        CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");

        // Open two files
        try {
            InputStream e1 = cr.openFile("bogus");
            fail("File not found");

        } catch (IOException e) {
            /* success */
            //System.out.println("SUCCESS: File Not Found: " + e);
        }

        cr.close();
        
        closeUp(streams);
    }


    public void testReadPastEOF() throws IOException {
        Map streams = setUp_2();
        CompoundFileReader cr = new CompoundFileReader(dir, "f.comp");
        InputStream is = cr.openFile("f2");
        is.seek(is.length() - 10);
        byte b[] = new byte[100];
        is.readBytes(b, 0, 10);

        try {
            byte test = is.readByte();
            fail("Single byte read past end of file");
        } catch (IOException e) {
            /* success */
            //System.out.println("SUCCESS: single byte read past end of file: " + e);
        }

        is.seek(is.length() - 10);
        try {
            is.readBytes(b, 0, 50);
            fail("Block read past end of file");
        } catch (IOException e) {
            /* success */
            //System.out.println("SUCCESS: block read past end of file: " + e);
        }

        is.close();
        cr.close();
        
        closeUp(streams);
    }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

optimized disk usage when creating a compound index

Reply via email to