[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

Shai Erera (JIRA) Sun, 16 May 2010 23:52:10 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868103#action_12868103
 ]


Shai Erera commented on LUCENE-2455:
------------------------------------

So ... after I slept over it, I don't think I can easily let go of the so-near 
victory -- having addIndexes not blocking anything, done on the side, be as 
efficient as possible and give up a chance to do some serious code cleanup :). 
So I'd like to propose the following, we can discuss them one by one, but they 
all aim at the same thing:
# addIndexes(Directory ...) will be a quick & dirty copy of segments. No 
deleted docs removal in the process, no merges.
#* It's clear how this can be done for segments that don't share doc stores.
#* What isn't clear to me is why can't this work for segments that do share doc 
stores - can't we copy all of them at once, and add to segmentInfos when the 
segments + their doc store were copied? So, if I have 5 segments, w/ the 
following doc stores: 1-2, 3 and 4-5 (3 stores for 5 segments), can't I copy 
1+2+store, 3, 4+5+store? Wouldn't that work? I'll give it a shot.
# PayloadProcessorProvider won't be used in the process. If you want, you can 
set it on the source writer, and call optimize(). Performance-wise it will be 
nearly identical - instead of processing the postings + payloads during 
addIndexes, you'll do that during optimize() (all segments will be processed 
anyway) and addIndexes will do a bulk IO copy, which on most modern machines is 
really cheap.
#* Further more, you will end up w/ just one segment, which means it can be 
copied at once for sure.
#* It will also simplify PPP -- no need for DirPP anymore. PPP would get a Term 
and return a PP for that term, as it will always work on just one Directory.
#* People can still use it with the target IW if they want, but not through 
addIndexes.
# Apps can call maybeMerge or optimize() following addIndexes, if they want to.

What remains is addIndexes(IndexReader...) and I'm not sure why this cannot be 
removed. In the back of my head I remember a discussion about it once, so I 
guess there is a good reason. But at least from what I see now, and armed w/ my 
use cases only, it seems like even if you use an extension of IndexReader you 
should still be able to do a bulk copy? Hmm ... unless if your extension 
assumes different postings structure or something like that, which the regular 
SegmentReader won't know about -- then during addIndexes those postings are 
converted.

But, how common is this? And wouldn't it be better if such indexes are migrated 
beforehand? I mean, anyway after addIndexes those postings won't retain the 
custom IndexReader-assuming format? Or is there another reason?

If we go with that, then SegmentMerger can be simplified as well, assuming only 
SegmentReader?

What do you think?

> Some house cleaning in addIndexes*
> ----------------------------------
>
>                 Key: LUCENE-2455
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2455
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Trivial
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2455_3x.patch
>
>
> Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
> especially on when to invoke each. Also, addIndexes calls optimize() in 
> the beginning, but only on the target index. It also includes the 
> following jdoc statement, which from how I understand the code, is 
> wrong: _After this completes, the index is optimized._ -- optimize() is 
> called in the beginning and not in the end. 
> On the other hand, addIndexesNoOptimize does not call optimize(), and 
> relies on the MergeScheduler and MergePolicy to handle the merges. 
> After a short discussion about that on the list (Thanks Mike for the 
> clarifications!) I understand that there are really two core differences 
> between the two: 
> * addIndexes supports IndexReader extensions
> * addIndexesNoOptimize performs better
> This issue proposes the following:
> # Clear up the documentation of each, spelling out the pros/cons of 
>   calling them clearly in the javadocs.
> # Rename addIndexesNoOptimize to addIndexes
> # Remove optimize() call from addIndexes(IndexReader...)
> # Document that clearly in both, w/ a recommendation to call optimize() 
>   before on any of the Directories/Indexes if it's a concern. 
> That way, we maintain all the flexibility in the API - 
> addIndexes(IndexReader...) allows for using IR extensions, 
> addIndexes(Directory...) is considered more efficient, by allowing the 
> merges to happen concurrently (depending on MS) and also factors in the 
> MP. So unless you have an IR extension, addDirectories is really the one 
> you should be using. And you have the freedom to call optimize() before 
> each if you care about it, or don't if you don't care. Either way, 
> incurring the cost of optimize() is entirely in the user's hands. 
> BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
> nor MergePolicy, but rather call SegmentMerger directly. This might be 
> another place for improvement. I'll look into it, and if it's not too 
> complicated, I may cover it by this issue as well. If you have any hints 
> that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

Reply via email to