Segment size limit for compound files
-------------------------------------

         Key: LUCENE-624
         URL: http://issues.apache.org/jira/browse/LUCENE-624
     Project: Lucene - Java
        Type: Improvement

  Components: Index  
    Reporter: Michael Busch
    Priority: Minor


Hello everyone,

I implemented an improvement targeting compound file usage. Compound files are 
used to decrease the number of index files, because operating systems can't 
handle too many open file descriptors. On the other hand, a disadvantage of 
compound file format is the worse performance compared to multi-file indexes:

http://www.gossamer-threads.com/lists/lucene/java-user/8950

In the book "Lucene in Action" it's said that compound file format is about 
5-10% slower than multi-file format.


The patch I'm proposing here adds the ability to the IndexWriter to use 
compound format only for segments, that do not contain more documents than a 
specific limit "CompoundFileSegmentSizeLimit", which the user can set.

Due to the exponential merges, a lucene index usually contains only a few very 
big segments, but much more small segments. The best performance is actually 
just needed for the big segments, whereas a slighly worse performance for small 
segments shouldn't play a big role in the overall search performance.


Consider the following example:
Index Size:                            1,500,000
Merge factor:                        10
Max buffered docs:             100
Number of indexed fields: 10
Max. OS file descriptors:    1024

in the worst case a not-optimized index could contain the following amount of 
segments:
1 x 1,000,000
9 x   100,000
9 x    10,000
9 x     1,000
9 x       100

That's 37 segments. A multi-file format index would have:
37 segments * (7 files per segment + 10 files for indexed fields) = 629 files 
==> only about 2 open indexes per machine could be handled by the operating 
system

A compound-file format index would have:
37 segments * 1 cfs file = 37 files ==> about 27 open indexes could be handled 
by the operating system, but performance would be 5-10% worse.

A compound-file format index with CompoundFileSegmentSizeLimit = 1,000,000 
would have:
36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 20 open 
indexes could be handled by the OS


The OS can handle now 20 instead of just 2 open indexes, while maintaining the 
multi-file format performance.

I'm going to create diffs on the current HEAD and will attach the patch files 
soon. Please let me know what you think about this improvement.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to