[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Michael McCandless (JIRA) Wed, 13 May 2009 10:24:15 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709015#action_12709015
 ]


Michael McCandless commented on LUCENE-1634:
--------------------------------------------

bq. So to implement your own MergePolicy, you have to resort back to sneaking 
the class into the package.

Right, this is currently necessary for a custom MergePolicy/Scheduler.  It's 
been discussed before:

  
http://www.nabble.com/MergePolicy-public-but-SegmentInfos-package-protected--tt22687527.html

I suppose since merge selection needs so little info about a segment, we could 
make a public thine wrapper/veneer that exposes a limited number of things.  Or 
maybe we go whole hog and simply make SegmentInfos/SegmentInfo public.

bq.  it is hidden because these methods are package protected

If we could javadoc certain package protected classes, that'd give you the 
javadocs at least.  Or we should simply make these methods public?

bq. Not only seg/getUseCompoundFile is no longer applicable if LogMergePolicy 
is not used, also popular methods such as set/getMergeFactor etc. are only 
applicable to LogMergePolicy.

But the notion of a mergeFactor is very much a LogMergePolicy specific thing.  
Other merge policies might not limit themselves to always merging mergeFactor 
segments at once.  These are convenience methods on IndexWriter (that simply 
forward the request to the MergePolicy).

bq. my guess is that set/getCompoundFile should be applicable to all 
implementations of MergePolicy

I think that'd make sense.  (I can't remember exactly why, but way back when, I 
think there was some reason for not doing so...)

> LogMergePolicy should use the number of deleted docs when deciding which 
> segments to merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments 
> with a lot of deletes even when most of the docs are deleted. And the 
> existence of such segments affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a 
> few thousand at a time.  I ran optimize(20) occasionally. What saw were large 
> segments with most of docs deleted. Although these segments did not have 
> valid docs they remained in the directory for a very long time until more 
> segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments 
> but does not take the number of deleted documents into consideration when it 
> decides which segments to merge. So, a simple fix is to use the delete count 
> to calibrate the segment size. I can create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge

Reply via email to