[ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]
Michael Busch updated LUCENE-624: --------------------------------- Attachment: cfs_seg_size_limit.patch I attach the patch file for this improvement. This patch adds two new methods to the API of IndexWriter and IndexModifier: /** Get the current value of the compound file segment size limit. * Note that this just returns the value you set with setCompoundFileSegmentSizeLimit(int) * or the default. You cannot use this to query the status of an existing index. * @see #setCompoundFileSegmentSizeLimit(int) */ public int getCompoundFileSegmentSizeLimit(); /** Sets the limit of documents a segment can have, so that * compound format is being used for that segment. A high * limit will decrease the number of files per index, whereas * a lower limit will improve search performance but * increase the number of files. */ public void setCompoundFileSegmentSizeLimit(int value); Furthermore I added a constant to IndexWriter: public final static int DEFAULT_COMPOUND_FILE_SEGMENT_SIZE_LIMIT = Integer.MAX_VALUE; Since the default value is set to Integer.MAX_VALUE, the behavior of IndexWriter/IndexModifier only changes if the user uses setCompoundFileSegmentSizeLimit(int) to change the value explicitly. > Segment size limit for compound files > ------------------------------------- > > Key: LUCENE-624 > URL: http://issues.apache.org/jira/browse/LUCENE-624 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Priority: Minor > Attachments: cfs_seg_size_limit.patch > > > Hello everyone, > I implemented an improvement targeting compound file usage. Compound files > are used to decrease the number of index files, because operating systems > can't handle too many open file descriptors. On the other hand, a > disadvantage of compound file format is the worse performance compared to > multi-file indexes: > http://www.gossamer-threads.com/lists/lucene/java-user/8950 > In the book "Lucene in Action" it's said that compound file format is about > 5-10% slower than multi-file format. > The patch I'm proposing here adds the ability to the IndexWriter to use > compound format only for segments, that do not contain more documents than a > specific limit "CompoundFileSegmentSizeLimit", which the user can set. > Due to the exponential merges, a lucene index usually contains only a few > very big segments, but much more small segments. The best performance is > actually just needed for the big segments, whereas a slighly worse > performance for small segments shouldn't play a big role in the overall > search performance. > Consider the following example: > Index Size: 1,500,000 > Merge factor: 10 > Max buffered docs: 100 > Number of indexed fields: 10 > Max. OS file descriptors: 1024 > in the worst case a not-optimized index could contain the following amount of > segments: > 1 x 1,000,000 > 9 x 100,000 > 9 x 10,000 > 9 x 1,000 > 9 x 100 > That's 37 segments. A multi-file format index would have: > 37 segments * (7 files per segment + 10 files for indexed fields) = 629 files > ==> only about 2 open indexes per machine could be handled by the operating > system > A compound-file format index would have: > 37 segments * 1 cfs file = 37 files ==> about 27 open indexes could be > handled by the operating system, but performance would be 5-10% worse. > A compound-file format index with CompoundFileSegmentSizeLimit = 1,000,000 > would have: > 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 20 open > indexes could be handled by the OS > The OS can handle now 20 instead of just 2 open indexes, while maintaining > the multi-file format performance. > I'm going to create diffs on the current HEAD and will attach the patch files > soon. Please let me know what you think about this improvement. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]