[jira] Commented: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode

Michael McCandless (JIRA) Fri, 13 Nov 2009 02:33:17 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777436#action_12777436
 ]


Michael McCandless commented on LUCENE-2047:
--------------------------------------------

{quote}
we'd need to
prevent the deletion of the SR's files we're deleting from, even
if that SR is no longer live. 
{quote}

It's strange that anything here is needed, because, when you check a
reader out from the pool, it's incRef'd, which should mean the files
need no protection.  Something strange is up... could it be that when
you checkout that reader to do deletions, it wasn't already open, and
then on trying to open it, its files were already deleted?  (In which
case, that segment has been merged away, and, the merge has committed,
ie already carried over all deletes, and so you should instead be
deleting against that merged segment).

So I think the sync(IW) is in fact necessary?  Note that the current
approach (deferring resolving term -> docIDs until flush time) aiso
sync(IW)'d, so we're not really changing that, here.  Though I agree
it would be nice to not have to sync(IW).  Really what we need to sync
on is "any merge that is merging this segment away and now wants to
commit".  That's actually a very narrow event so someday (separate
issue) if we could refine the sync'ing to that, it should be a good
net throughput improvement for updateDocument.

{quote}
What happens to
documents that need to be deleted but are still in the RAM
buffer?
{quote}

Ahh, yes.  We must still buffer for this case, and resolve these
deletes against the newly flushed segment.  I think we need a separate
buffer that tracks pending delete terms only against the RAM buffer?

Also, instead of actually setting the bits in SR's deletedDocs, I
think you should buffer the deleted docIDs into DW's
deletesInRAM.docIDs?  Ie, we do the resolution of Term/Query -> docID,
but buffer the docIDs we resolved to.  This is necessary for
correctness in exceptional situations, eg if you do a bunch of
updateDocuments, then DW hits an aborting exception (meaning its RAM
buffer may be corrupt) then DW currently discards the RAM buffer, but,
leaves previously flushed segments intact, so that if you then commit,
you have a consistent index.  Ie, in that situation, we don't want the
docs deleted by updateDocument calls to be committed to the index, so
we need to buffer them.


> IndexWriter should immediately resolve deleted docs to docID in 
> near-real-time mode
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-2047
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2047
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2047.patch, LUCENE-2047.patch
>
>
> Spinoff from LUCENE-1526.
> When deleteDocuments(Term) is called, we currently always buffer the
> Term and only later, when it's time to flush deletes, resolve to
> docIDs.  This is necessary because we don't in general hold
> SegmentReaders open.
> But, when IndexWriter is in NRT mode, we pool the readers, and so
> deleting in the foreground is possible.
> It's also beneficial, in that in can reduce the turnaround time when
> reopening a new NRT reader by taking this resolution off the reopen
> path.  And if multiple threads are used to do the deletion, then we
> gain concurrency, vs reopen which is not concurrent when flushing the
> deletes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2047) IndexWriter should immediately resolve deleted docs to docID in near-real-time mode

Reply via email to