[ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]

Michael Busch updated LUCENE-624:
---------------------------------

    Attachment: cfs_seg_size_limit.patch

I attach the patch file for this improvement.

This patch adds two new methods to the API of IndexWriter and IndexModifier:
  /** Get the current value of the compound file segment size limit.
   *  Note that this just returns the value you set with 
setCompoundFileSegmentSizeLimit(int)
   *  or the default. You cannot use this to query the status of an existing 
index.
   *  @see #setCompoundFileSegmentSizeLimit(int)
   */
  public int getCompoundFileSegmentSizeLimit();
    
  /** Sets the limit of documents a segment can have, so that
   *  compound format is being used for that segment. A high
   *  limit will decrease the number of files per index, whereas
   *  a lower limit will improve search performance but 
   *  increase the number of files.
   */
  public void setCompoundFileSegmentSizeLimit(int value);

Furthermore I added a constant to IndexWriter:
public final static int DEFAULT_COMPOUND_FILE_SEGMENT_SIZE_LIMIT = 
Integer.MAX_VALUE;

Since the default value is set to Integer.MAX_VALUE, the behavior of 
IndexWriter/IndexModifier only changes if the user uses 
setCompoundFileSegmentSizeLimit(int) to change the value explicitly. 

> Segment size limit for compound files
> -------------------------------------
>
>                 Key: LUCENE-624
>                 URL: http://issues.apache.org/jira/browse/LUCENE-624
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: cfs_seg_size_limit.patch
>
>
> Hello everyone,
> I implemented an improvement targeting compound file usage. Compound files 
> are used to decrease the number of index files, because operating systems 
> can't handle too many open file descriptors. On the other hand, a 
> disadvantage of compound file format is the worse performance compared to 
> multi-file indexes:
> http://www.gossamer-threads.com/lists/lucene/java-user/8950
> In the book "Lucene in Action" it's said that compound file format is about 
> 5-10% slower than multi-file format.
> The patch I'm proposing here adds the ability to the IndexWriter to use 
> compound format only for segments, that do not contain more documents than a 
> specific limit "CompoundFileSegmentSizeLimit", which the user can set.
> Due to the exponential merges, a lucene index usually contains only a few 
> very big segments, but much more small segments. The best performance is 
> actually just needed for the big segments, whereas a slighly worse 
> performance for small segments shouldn't play a big role in the overall 
> search performance.
> Consider the following example:
> Index Size:                            1,500,000
> Merge factor:                        10
> Max buffered docs:             100
> Number of indexed fields: 10
> Max. OS file descriptors:    1024
> in the worst case a not-optimized index could contain the following amount of 
> segments:
> 1 x 1,000,000
> 9 x   100,000
> 9 x    10,000
> 9 x     1,000
> 9 x       100
> That's 37 segments. A multi-file format index would have:
> 37 segments * (7 files per segment + 10 files for indexed fields) = 629 files 
> ==> only about 2 open indexes per machine could be handled by the operating 
> system
> A compound-file format index would have:
> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes could be 
> handled by the operating system, but performance would be 5-10% worse.
> A compound-file format index with CompoundFileSegmentSizeLimit = 1,000,000 
> would have:
> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 20 open 
> indexes could be handled by the OS
> The OS can handle now 20 instead of just 2 open indexes, while maintaining 
> the multi-file format performance.
> I'm going to create diffs on the current HEAD and will attach the patch files 
> soon. Please let me know what you think about this improvement.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to