[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Marvin Humphrey (JIRA) Wed, 16 Dec 2009 11:48:42 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791549#action_12791549
 ]


Marvin Humphrey commented on LUCENE-2026:
-----------------------------------------

>> Wasn't that a possibility under autocommit as well? All it takes is for the
>> OS to finish flushing the new snapshot file to persistent storage before it
>> finishes flushing a segment data file needed by that snapshot, and for the
>> power failure to squeeze in between.
> 
> Not after LUCENE-1044... autoCommit simply called commit() at certain
> opportune times (after finish big merges), which does the right thing (I
> hope!). The segments file is not written until all files it references are
> sync'd.

FWIW, autoCommit doesn't really have a place in Lucy's
one-segment-per-indexing-session model.

Revisiting the LUCENE-1044 threads, one passage stood out:

{panel}
    http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321

    This is why in a db system, the only file that is sync'd is the log
    file - all other files can be made "in sync" from the log file - and
    this file is normally striped for optimum write performance. Some
    systems have special "log file drives" (some even solid state, or
    battery backed ram) to aid the performance. 
{panel}

The fact that we have to sync all files instead of just one seems sub-optimal.

Yet Lucene is not well set up to maintain a transaction log.  The very act of
adding a document to Lucene is inherently lossy even if all fields are stored,
because doc boost is not preserved.

> Also, having the app explicitly decouple these two notions keeps the
> door open for future improvements. If we force absolutely all sharing
> to go through the filesystem then that limits the improvements we can
> make to NRT.

However, Lucy has much more to gain going through the file system than Lucene
does, because we don't necessarily incur JVM startup costs when launching a
new process.  The Lucene approach to NRT -- specialized reader hanging off of
writer -- is constrained to a single process.  The Lucy approach -- fast index
opens enabled by mmap-friendly index formats -- is not.

The two approaches aren't mutually exclusive.  It will be possible to augment
Lucy with a specialized index reader within a single process.  However, A)
there seems to be a lot of disagreement about just how to integrate that
reader, and B) there seem to be ways to bolt that functionality on top of the
existing classes.  Under those circumstances, I think it makes more sense to
keep that feature external for now.

> Alternatively, you could keep the notion "flush" (an unsafe commit)
> alive? You write the segments file, but make no effort to ensure it's
> durability (and also preserve the last "true" commit). Then a normal
> IR.reopen suffices...

That sounds promising.  The semantics would differ from those of Lucene's
flush(), which doesn't make changes visible.

We could implement this by somehow marking a "committed" snapshot and a
"flushed" snapshot differently, either by adding an "fsync" property to the
snapshot file that would be false after a flush() but true after a commit(),
or by encoding the property within the snapshot filename.  The file purger
would have to ensure that all index files referenced by either the last
committed snapshot or the last flushed snapshot were off limits.  A rollback()
would zap all changes since the last commit().  

Such a scheme allows the the top level app to avoid the costs of fsync while
maintaining its own transaction log -- perhaps with the optimizations
suggested above (separate disk, SSD, etc). 

> Refactoring of IndexWriter
> --------------------------
>
>                 Key: LUCENE-2026
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2026
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

Reply via email to