[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

Shai Erera (JIRA) Thu, 20 May 2010 04:09:20 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869563#action_12869563
 ]


Shai Erera commented on LUCENE-2455:
------------------------------------

I've started to implement addIndexes(Directory...) as agreed - copy files from 
the incoming ones into the local directory, while renaming them on the fly. 
This works really well with non-CFS segments: a new segment name is generated, 
the incoming files are renamed and this all flies smoothly (didn't test w/ 
deletions yet) - even shared doc stores work great.

But with CFS it doesn't work well because CFS writes the file names in the CFS 
file itself, and so even if the segment is renamed to _5 (for example), the 
names that are written in the file are _2.* (for example), and openInput fails 
to locate them. To overcome this, I propose we do the following:

* Introduce on IndexFileNames a stripName method (3x and trunk) - will return 
the file name w/o the _x part.
* CFR ctor - strip names of read file names by calling IFN.stripName --> 3x only
* CFR.openInput - strip name by calling IFN.stripName --> 3x and trunk
* Document that files should be created through IFN only --> 3x (for clarity) 
and trunk (otherwise may not be supported).
* Not save the name in CFS --> trunk only. Will remove the need to strip it off 
when it's read.

That will ensure that files are named following a certain convention which we 
can rely on in CFR. I don't think it's too hard to ask for. CFS itself already 
knows the name - it's named like it. So there's no value in storing the names 
of the files it holds.

For 3x it should work well b/c we don't allow for custom index files. For trunk 
we'll ask to go through IFN to name files - so one can create mycustom.file 
through IFN which will be called _x_mycustom.file.

What do you think?

> Some house cleaning in addIndexes*
> ----------------------------------
>
>                 Key: LUCENE-2455
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2455
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Trivial
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2455_3x.patch
>
>
> Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
> especially on when to invoke each. Also, addIndexes calls optimize() in 
> the beginning, but only on the target index. It also includes the 
> following jdoc statement, which from how I understand the code, is 
> wrong: _After this completes, the index is optimized._ -- optimize() is 
> called in the beginning and not in the end. 
> On the other hand, addIndexesNoOptimize does not call optimize(), and 
> relies on the MergeScheduler and MergePolicy to handle the merges. 
> After a short discussion about that on the list (Thanks Mike for the 
> clarifications!) I understand that there are really two core differences 
> between the two: 
> * addIndexes supports IndexReader extensions
> * addIndexesNoOptimize performs better
> This issue proposes the following:
> # Clear up the documentation of each, spelling out the pros/cons of 
>   calling them clearly in the javadocs.
> # Rename addIndexesNoOptimize to addIndexes
> # Remove optimize() call from addIndexes(IndexReader...)
> # Document that clearly in both, w/ a recommendation to call optimize() 
>   before on any of the Directories/Indexes if it's a concern. 
> That way, we maintain all the flexibility in the API - 
> addIndexes(IndexReader...) allows for using IR extensions, 
> addIndexes(Directory...) is considered more efficient, by allowing the 
> merges to happen concurrently (depending on MS) and also factors in the 
> MP. So unless you have an IR extension, addDirectories is really the one 
> you should be using. And you have the freedom to call optimize() before 
> each if you care about it, or don't if you don't care. Either way, 
> incurring the cost of optimize() is entirely in the user's hands. 
> BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
> nor MergePolicy, but rather call SegmentMerger directly. This might be 
> another place for improvement. I'll look into it, and if it's not too 
> complicated, I may cover it by this issue as well. If you have any hints 
> that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*

Reply via email to