Probably not during indexing, which is what Michael was referring to in his last email, if I understood him correctly. I suppose indexing with compound format would be a bit slower because individual index files will have to be compounded in a .cfs file, and that'll consume a bit of extra time.
Otis ----- Original Message ---- From: robert engels <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Thursday, July 27, 2006 8:48:53 PM Subject: Re: [jira] Updated: (LUCENE-624) Segment size limit for compound files In my experience, the more segment files the worse the performance (thus the optimize method). On Jul 27, 2006, at 7:44 PM, Michael Busch wrote: > robert engels wrote: >> Why does more segment files improve search performance? I can see >> that if you have many smaller files, the merge process for >> incremental adds might be faster, but more segments should >> actually make searching slower. > Robert, > > I did not run my own performance experiments, but after reading > come threads about compound performance again I think you are > right. Compound file format does not affect search performance > significantly, but it slows down indexing time by 5-10%. So this > tiny patch should improve indexing speed while keeping the number > of segment files relatively low. If I find some time I will run > performance experiments to get some numbers. > > Michael > >> On Jul 26, 2006, at 5:18 PM, Michael Busch (JIRA) wrote: >> >>> [ http://issues.apache.org/jira/browse/LUCENE-624?page=all ] >>> >>> Michael Busch updated LUCENE-624: >>> --------------------------------- >>> >>> Attachment: cfs_seg_size_limit.patch >>> >>> I attach the patch file for this improvement. >>> >>> This patch adds two new methods to the API of IndexWriter and >>> IndexModifier: >>> /** Get the current value of the compound file segment size limit. >>> * Note that this just returns the value you set with >>> setCompoundFileSegmentSizeLimit(int) >>> * or the default. You cannot use this to query the status of >>> an existing index. >>> * @see #setCompoundFileSegmentSizeLimit(int) >>> */ >>> public int getCompoundFileSegmentSizeLimit(); >>> >>> /** Sets the limit of documents a segment can have, so that >>> * compound format is being used for that segment. A high >>> * limit will decrease the number of files per index, whereas >>> * a lower limit will improve search performance but >>> * increase the number of files. >>> */ >>> public void setCompoundFileSegmentSizeLimit(int value); >>> >>> Furthermore I added a constant to IndexWriter: >>> public final static int DEFAULT_COMPOUND_FILE_SEGMENT_SIZE_LIMIT >>> = Integer.MAX_VALUE; >>> >>> Since the default value is set to Integer.MAX_VALUE, the behavior >>> of IndexWriter/IndexModifier only changes if the user uses >>> setCompoundFileSegmentSizeLimit(int) to change the value explicitly. >>> >>>> Segment size limit for compound files >>>> ------------------------------------- >>>> >>>> Key: LUCENE-624 >>>> URL: http://issues.apache.org/jira/browse/ >>>> LUCENE-624 >>>> Project: Lucene - Java >>>> Issue Type: Improvement >>>> Components: Index >>>> Reporter: Michael Busch >>>> Priority: Minor >>>> Attachments: cfs_seg_size_limit.patch >>>> >>>> >>>> Hello everyone, >>>> I implemented an improvement targeting compound file usage. >>>> Compound files are used to decrease the number of index files, >>>> because operating systems can't handle too many open file >>>> descriptors. On the other hand, a disadvantage of compound file >>>> format is the worse performance compared to multi-file indexes: >>>> http://www.gossamer-threads.com/lists/lucene/java-user/8950 >>>> In the book "Lucene in Action" it's said that compound file >>>> format is about 5-10% slower than multi-file format. >>>> The patch I'm proposing here adds the ability to the IndexWriter >>>> to use compound format only for segments, that do not contain >>>> more documents than a specific limit >>>> "CompoundFileSegmentSizeLimit", which the user can set. >>>> Due to the exponential merges, a lucene index usually contains >>>> only a few very big segments, but much more small segments. The >>>> best performance is actually just needed for the big segments, >>>> whereas a slighly worse performance for small segments shouldn't >>>> play a big role in the overall search performance. >>>> Consider the following example: >>>> Index Size: 1,500,000 >>>> Merge factor: 10 >>>> Max buffered docs: 100 >>>> Number of indexed fields: 10 >>>> Max. OS file descriptors: 1024 >>>> in the worst case a not-optimized index could contain the >>>> following amount of segments: >>>> 1 x 1,000,000 >>>> 9 x 100,000 >>>> 9 x 10,000 >>>> 9 x 1,000 >>>> 9 x 100 >>>> That's 37 segments. A multi-file format index would have: >>>> 37 segments * (7 files per segment + 10 files for indexed >>>> fields) = 629 files ==> only about 2 open indexes per machine >>>> could be handled by the operating system >>>> A compound-file format index would have: >>>> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes >>>> could be handled by the operating system, but performance would >>>> be 5-10% worse. >>>> A compound-file format index with CompoundFileSegmentSizeLimit = >>>> 1,000,000 would have: >>>> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> >>>> about 20 open indexes could be handled by the OS >>>> The OS can handle now 20 instead of just 2 open indexes, while >>>> maintaining the multi-file format performance. >>>> I'm going to create diffs on the current HEAD and will attach >>>> the patch files soon. Please let me know what you think about >>>> this improvement. >>> >>> --This message is automatically generated by JIRA. >>> - >>> If you think it was sent incorrectly contact one of the >>> administrators: http://issues.apache.org/jira/secure/ >>> Administrators.jspa >>> - >>> For more information on JIRA, see: http://www.atlassian.com/ >>> software/jira >>> >>> >>> >>> -------------------------------------------------------------------- >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]