[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Steven Parkes (JIRA) Thu, 16 Aug 2007 16:21:56 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520408
 ]


Steven Parkes commented on LUCENE-847:
--------------------------------------

        I don't think so: I think if someone changes the merge policy to
        something else, it's fine to require that they then do settings
        directly through that merge policy.

You're going to want to change the default merge policy, right?  So you're 
going to change the hard cast in IW to that policy? So it'll fail for anyone 
that wants to just getMergePolicy back to the old policy?

If that's the case, I'm going to keep those tests the way they are because when 
you do change the policy, I'm assuming you'll keep many of them, just add the 
manual setMergePolicy(), and they'll need to have those casts put back in?

Maybe we just put it in MergePolicy interface and let them throw (e.g., via 
MergePolicyBase) if called on an unsupported merge policy? That's moving from 
compile time checking to run time checking, but ... 

        This is inside addIndexes that we're talking about.

Ah. Right.

        I think we shouldn't allow any mergePolicy to
        leave the index inconsistent (failing to copy over segments from other
        directories).

That makes sense to me. CMP could enforce this, even in the case of concurrent 
merges.

        No, after another N (= mergeFactor) flushes, it would return a new
        suggested merge.

Okay. I think I'm following you here.

Here's what I understand: in your model, (1) each call to merge will only ever 
generate one merge thread (regardless of how many levels might be full) and (2) 
you can get concurrency out of this as long as you consider a level "merge 
worthy" as different from "full", i.e., blocking).

You did say  

> > But, we
> > could relax that such that if ever the lowest level has >
> > 2*mergeFactor pending segments to merge then we select the 2nd
> > set.

And I think you'd want to modify that to select the lowest sufficiently over 
subscribed level, not just the lowest level if it's oversubscribed?

Perhaps this is sufficient, but not necessary? I see it as simpler just to have 
the merge policy (abstractly) generate a set of non-conflicting merges and let 
someone else worry about scheduling them.

        But, providing just a single concurrent merge already gains us
        concurrency of merging with adding of docs.

I'm worried about when you start the leftmost merge, that, say, is going to 
take a day. With a steady influx of docs, it's not going to be long before you 
need another merge and if you have only one thread, you're going to block for 
the rest of the day. You've bought a little concurrency, but it's the almost 
day-long block I really want to avoid.

With a log-like policy, I think it's feasible to have logN threads. You might 
not want them all doing disk i/o at the same time: you'd want to prioritize 
threads on the small merges and/or suspend large merge threads.  The speed with 
which the larger merge threads can vary when other merges are taking place, you 
just have to not stop them and start over. 

        Right, the LUCENE-845 merge policy doesn't look @ the return result of
        "merge".  It just looks at the newly created SegmentInfos.

Yeah. My thinking was this would be tweaked. If merger.merge returns a valid 
number of docs, it could recurse as it does. If merger.merge returned -1 (which 
CMP does), it would not recurse but simply continue the loop.

        Hmmmm, in fact, I think your CMP wrapper would not work with the merge
        policy in LUCENE-845, right?  Ie, won't it will just recurse forever?
        So actually I don't see how your CMP (using the current API) can in
        general safely "wrap" around a merge policy w/o breaking things?

I think it's safe, just not concurrent. The recursion would generate the same 
set of segments to merge and CMP would make the second call block (abstractly, 
anyway: it actually throws an exception that unwinds the stack and causes the 
call to start again from the top when the conflicting merge finishes).

        But, if you lock on IndexWriter, what about apps that use multiple
        threads to add documents and but don't use CMP?  When one thread gets
        tied up merging, you'll then block on the other synchronized methods?
        And you also can't flush from other threads either?  I think flushing
        a new segment should be allowed to run concurrently with the merge?

I'm not sure I'm following this. That's what happens now, right? Are you trying 
to get more concurrency then there is now w/o using CMP? I certainly haven't 
been trying to do that.

        I guess I don't
        see the reason to synchronize on IndexWriter instead of segmentInfos.

I looked at trying to make IW work when a synchronization of IW didn't imply a 
synchronization of segmentInfos. It's a very, very heavily used little data 
structure. I found it very hard to convince myself I could catch all the places 
locks would be required. And at the same time, I seemed to be able to do 
everything I needed with IW locking.

That said, the code's not done, so ....

        Net/net I'm still thinking we should simplify this API to be
        stateless.  I think there are a number of benefits:

         * We would no longer need to add a new IndexMerger interface that
            adds unecessary complexity to Lucnee (and, make the awkward
            decisions up front on which IndexWriter fields are allowed to be
            visible through the interface).

          * Keep CMP simpler (only top of stack (where I think "macro"
            concurrency should live), not top and bottom).

          * Work correctly as wrapper around other merge policies (ie not hit
            infinite recursion because mergePolicy had naturally assumed that
            "merge" would have changed the segmentInfos)

          * Allows locking on segmentInfos (not IndexWriter), and allows
           concurrency on multiple threads adding docs even without using
           CMP.

Hmmm ... I guess our approaches are pretty different. If you want to take a 
stab at this ...

> Factor merge policy out of IndexWriter
> --------------------------------------
>
>                 Key: LUCENE-847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-847
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Steven Parkes
>            Assignee: Steven Parkes
>         Attachments: concurrentMerge.patch, LUCENE-847.patch.txt, 
> LUCENE-847.patch.txt, LUCENE-847.txt
>
>
> If we factor the merge policy out of IndexWriter, we can make it pluggable, 
> making it possible for apps to choose a custom merge policy and for easier 
> experimenting with merge policy variants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-847) Factor merge policy out of IndexWriter

Reply via email to