[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Michael McCandless (JIRA) Thu, 10 Dec 2009 11:32:42 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788856#action_12788856
 ]


Michael McCandless commented on LUCENE-2026:
--------------------------------------------

I think what you're describing is in fact the approach that LUCENE-1313 is 
taking; it's doing the switching internally between the main Dir & a private 
RAM Dir.

But in my testing so far (LUCENE-2061), it doesn't seem like it'll help 
performance much.  Ie, the OS generally seems to do a fine job putting those 
segments in RAM, itself.  Ie, by maintaining a write cache.  The weirdness is: 
that only holds true if you flush the segments when they are tiny (once per 
second, every 100 docs, in my test) -- not yet sure why that's the case.  I'm 
going to re-run perf tests on a more mainstream OS (my tests are all 
OpenSolaris) and see if that strangeness still happens.

But I think you still need to not do commit() during the reopen.

I do think refactoring IW so that there is a separate component that keeps 
track of segments in the index, may simplify NRT, in that you can go to that 
source for your current "segments file" even if that segments file is 
uncommitted.  In such a world you could do something like 
IndexReader.open(SegmentState) and it would be able to open (and, reopen) the 
real-time reader.  It's just that it's seeing changes to the SegmentState done 
by the writer, even if they're not yet committed.

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to