[ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]
Michael Busch updated LUCENE-624:
---------------------------------
Attachment: cfs_seg_size_limit.patch
I attach the patch file for this improvement.
This patch adds two new methods to the API of IndexWriter and IndexModifier:
/** Get the current value of the compound file segment size limit.
* Note that this just returns the value you set with
setCompoundFileSegmentSizeLimit(int)
* or the default. You cannot use this to query the status of an existing
index.
* @see #setCompoundFileSegmentSizeLimit(int)
*/
public int getCompoundFileSegmentSizeLimit();
/** Sets the limit of documents a segment can have, so that
* compound format is being used for that segment. A high
* limit will decrease the number of files per index, whereas
* a lower limit will improve search performance but
* increase the number of files.
*/
public void setCompoundFileSegmentSizeLimit(int value);
Furthermore I added a constant to IndexWriter:
public final static int DEFAULT_COMPOUND_FILE_SEGMENT_SIZE_LIMIT =
Integer.MAX_VALUE;
Since the default value is set to Integer.MAX_VALUE, the behavior of
IndexWriter/IndexModifier only changes if the user uses
setCompoundFileSegmentSizeLimit(int) to change the value explicitly.
> Segment size limit for compound files
> -------------------------------------
>
> Key: LUCENE-624
> URL: http://issues.apache.org/jira/browse/LUCENE-624
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael Busch
> Priority: Minor
> Attachments: cfs_seg_size_limit.patch
>
>
> Hello everyone,
> I implemented an improvement targeting compound file usage. Compound files
> are used to decrease the number of index files, because operating systems
> can't handle too many open file descriptors. On the other hand, a
> disadvantage of compound file format is the worse performance compared to
> multi-file indexes:
> http://www.gossamer-threads.com/lists/lucene/java-user/8950
> In the book "Lucene in Action" it's said that compound file format is about
> 5-10% slower than multi-file format.
> The patch I'm proposing here adds the ability to the IndexWriter to use
> compound format only for segments, that do not contain more documents than a
> specific limit "CompoundFileSegmentSizeLimit", which the user can set.
> Due to the exponential merges, a lucene index usually contains only a few
> very big segments, but much more small segments. The best performance is
> actually just needed for the big segments, whereas a slighly worse
> performance for small segments shouldn't play a big role in the overall
> search performance.
> Consider the following example:
> Index Size: 1,500,000
> Merge factor: 10
> Max buffered docs: 100
> Number of indexed fields: 10
> Max. OS file descriptors: 1024
> in the worst case a not-optimized index could contain the following amount of
> segments:
> 1 x 1,000,000
> 9 x 100,000
> 9 x 10,000
> 9 x 1,000
> 9 x 100
> That's 37 segments. A multi-file format index would have:
> 37 segments * (7 files per segment + 10 files for indexed fields) = 629 files
> ==> only about 2 open indexes per machine could be handled by the operating
> system
> A compound-file format index would have:
> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes could be
> handled by the operating system, but performance would be 5-10% worse.
> A compound-file format index with CompoundFileSegmentSizeLimit = 1,000,000
> would have:
> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 20 open
> indexes could be handled by the OS
> The OS can handle now 20 instead of just 2 open indexes, while maintaining
> the multi-file format performance.
> I'm going to create diffs on the current HEAD and will attach the patch files
> soon. Please let me know what you think about this improvement.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]