[ http://issues.apache.org/jira/browse/LUCENE-624?page=all ]

Michael Busch closed LUCENE-624.
--------------------------------

    Resolution: Won't Fix
      Assignee: Michael Busch

I'm closing this issue, because:
- no votes or comments for almost half a year
- only indexing performance benefits slightly from this feature
- another config parameter in IndexWriter will probably confuse users more than 
help them

> Segment size limit for compound files
> -------------------------------------
>
>                 Key: LUCENE-624
>                 URL: http://issues.apache.org/jira/browse/LUCENE-624
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>            Priority: Minor
>         Attachments: cfs_seg_size_limit.patch
>
>
> Hello everyone,
> I implemented an improvement targeting compound file usage. Compound files 
> are used to decrease the number of index files, because operating systems 
> can't handle too many open file descriptors. On the other hand, a 
> disadvantage of compound file format is the worse performance compared to 
> multi-file indexes:
> http://www.gossamer-threads.com/lists/lucene/java-user/8950
> In the book "Lucene in Action" it's said that compound file format is about 
> 5-10% slower than multi-file format.
> The patch I'm proposing here adds the ability to the IndexWriter to use 
> compound format only for segments, that do not contain more documents than a 
> specific limit "CompoundFileSegmentSizeLimit", which the user can set.
> Due to the exponential merges, a lucene index usually contains only a few 
> very big segments, but much more small segments. The best performance is 
> actually just needed for the big segments, whereas a slighly worse 
> performance for small segments shouldn't play a big role in the overall 
> search performance.
> Consider the following example:
> Index Size:                            1,500,000
> Merge factor:                        10
> Max buffered docs:             100
> Number of indexed fields: 10
> Max. OS file descriptors:    1024
> in the worst case a not-optimized index could contain the following amount of 
> segments:
> 1 x 1,000,000
> 9 x   100,000
> 9 x    10,000
> 9 x     1,000
> 9 x       100
> That's 37 segments. A multi-file format index would have:
> 37 segments * (7 files per segment + 10 files for indexed fields) = 629 files 
> ==> only about 2 open indexes per machine could be handled by the operating 
> system
> A compound-file format index would have:
> 37 segments * 1 cfs file = 37 files ==> about 27 open indexes could be 
> handled by the operating system, but performance would be 5-10% worse.
> A compound-file format index with CompoundFileSegmentSizeLimit = 1,000,000 
> would have:
> 36 segments * 1 cfs file + 1 segment * (7 + 10 files) = 53 ==> about 20 open 
> indexes could be handled by the OS
> The OS can handle now 20 instead of just 2 open indexes, while maintaining 
> the multi-file format performance.
> I'm going to create diffs on the current HEAD and will attach the patch files 
> soon. Please let me know what you think about this improvement.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to