[ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]
Michael Busch closed LUCENE-624. -------------------------------- Resolution: Won't Fix Assignee: Michael Busch I'm closing this issue, because: - no votes or comments for almost half a year - only indexing performance benefits slightly from this feature - another config parameter in IndexWriter will probably confuse users more than help them > Segment size limit for compound files > ------------------------------------- > > Key: LUCENE-624 > URL: http://issues.apache.org/jira/browse/LUCENE-624 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assigned To: Michael Busch > Priority: Minor > Attachments: cfs_seg_size_limit.patch > > > Hello everyone, > I implemented an improvement targeting compound file usage. Compound files > are used to decrease the number of index files, because operating systems > can't handle too many open file descriptors. On the other hand, a > disadvantage of compound file format is the worse performance compared to > multi-file indexes: > http://www.gossamer-threads.com/lists/lucene/java-user/8950 > In the book "Lucene in Action" it's said that compound file format is about > 5-10% slower than multi-file format. > The patch I'm proposing here adds the ability to the IndexWriter to use > compound format only for segments, that do not contain more documents than a > specific limit "CompoundFileSegmentSizeLimit", which the user can set. > Due to the exponential merges, a lucene index usually contains only a few > very big segments, but much more small segments. The best performance is > actually just needed for the big segments, whereas a slighly worse > performance for small segments shouldn't play a big role in the overall > search performance. > Consider the following example: > Index Size: 1,500,000 > Merge factor: 10 > Max buffered docs: 100 > Number of indexed fields: 10 > Max. OS file descriptors: 1024 > in the worst case a not-optimized index could contain the following amount of > segments: > 1 x 1,000,000 > 9 x 100,000 > 9 x 10,000 > 9 x 1,000 > 9 x 100 > That's 37 segments. A multi-file format index would have: > 37 segments * (7 files per segment + 10 files for indexed fields) = 629 files > ==> only about 2 open indexes per machine could be handled by the operating > system > A compound-file format index would have: > 37 segments * 1 cfs file = 37 files ==> about 27 open indexes could be > handled by the operating system, but performance would be 5-10% worse. > A compound-file format index with CompoundFileSegmentSizeLimit = 1,000,000 > would have: > 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 20 open > indexes could be handled by the OS > The OS can handle now 20 instead of just 2 open indexes, while maintaining > the multi-file format performance. > I'm going to create diffs on the current HEAD and will attach the patch files > soon. Please let me know what you think about this improvement. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]