In my experience, the more segment files the worse the performance
(thus the optimize method).
On Jul 27, 2006, at 7:44 PM, Michael Busch wrote:
robert engels wrote:
Why does more segment files improve search performance? I can see
that if you have many smaller files, the merge process for
incremental adds might be faster, but more segments should
actually make searching slower.
Robert,
I did not run my own performance experiments, but after reading
come threads about compound performance again I think you are
right. Compound file format does not affect search performance
significantly, but it slows down indexing time by 5-10%. So this
tiny patch should improve indexing speed while keeping the number
of segment files relatively low. If I find some time I will run
performance experiments to get some numbers.
Michael
On Jul 26, 2006, at 5:18 PM, Michael Busch (JIRA) wrote:
[ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]
Michael Busch updated LUCENE-624:
---------------------------------
Attachment: cfs_seg_size_limit.patch
I attach the patch file for this improvement.
This patch adds two new methods to the API of IndexWriter and
IndexModifier:
/** Get the current value of the compound file segment size limit.
* Note that this just returns the value you set with
setCompoundFileSegmentSizeLimit(int)
* or the default. You cannot use this to query the status of
an existing index.
* @see #setCompoundFileSegmentSizeLimit(int)
*/
public int getCompoundFileSegmentSizeLimit();
/** Sets the limit of documents a segment can have, so that
* compound format is being used for that segment. A high
* limit will decrease the number of files per index, whereas
* a lower limit will improve search performance but
* increase the number of files.
*/
public void setCompoundFileSegmentSizeLimit(int value);
Furthermore I added a constant to IndexWriter:
public final static int DEFAULT_COMPOUND_FILE_SEGMENT_SIZE_LIMIT
= Integer.MAX_VALUE;
Since the default value is set to Integer.MAX_VALUE, the behavior
of IndexWriter/IndexModifier only changes if the user uses
setCompoundFileSegmentSizeLimit(int) to change the value explicitly.
Segment size limit for compound files
-------------------------------------
Key: LUCENE-624
URL: http://issues.apache.org/jira/browse/
LUCENE-624
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael Busch
Priority: Minor
Attachments: cfs_seg_size_limit.patch
Hello everyone,
I implemented an improvement targeting compound file usage.
Compound files are used to decrease the number of index files,
because operating systems can't handle too many open file
descriptors. On the other hand, a disadvantage of compound file
format is the worse performance compared to multi-file indexes:
http://www.gossamer-threads.com/lists/lucene/java-user/8950
In the book "Lucene in Action" it's said that compound file
format is about 5-10% slower than multi-file format.
The patch I'm proposing here adds the ability to the IndexWriter
to use compound format only for segments, that do not contain
more documents than a specific limit
"CompoundFileSegmentSizeLimit", which the user can set.
Due to the exponential merges, a lucene index usually contains
only a few very big segments, but much more small segments. The
best performance is actually just needed for the big segments,
whereas a slighly worse performance for small segments shouldn't
play a big role in the overall search performance.
Consider the following example:
Index Size: 1,500,000
Merge factor: 10
Max buffered docs: 100
Number of indexed fields: 10
Max. OS file descriptors: 1024
in the worst case a not-optimized index could contain the
following amount of segments:
1 x 1,000,000
9 x 100,000
9 x 10,000
9 x 1,000
9 x 100
That's 37 segments. A multi-file format index would have:
37 segments * (7 files per segment + 10 files for indexed
fields) = 629 files ==> only about 2 open indexes per machine
could be handled by the operating system
A compound-file format index would have:
37 segments * 1 cfs file = 37 files ==> about 27 open indexes
could be handled by the operating system, but performance would
be 5-10% worse.
A compound-file format index with CompoundFileSegmentSizeLimit =
1,000,000 would have:
36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==>
about 20 open indexes could be handled by the OS
The OS can handle now 20 instead of just 2 open indexes, while
maintaining the multi-file format performance.
I'm going to create diffs on the current HEAD and will attach
the patch files soon. Please let me know what you think about
this improvement.
--This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/
Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/
software/jira
--------------------------------------------------------------------
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]