Re: Unique doc ids

Michael McCandless Thu, 24 Jan 2008 07:50:46 -0800


Yonik Seeley wrote:

On Jan 24, 2008 5:47 AM, Michael McCandless<[EMAIL PROTECTED]> wrote:


Yonik Seeley wrote:

On Jan 23, 2008 6:34 AM, Michael McCandless
<[EMAIL PROTECTED]> wrote:

   writer.freezeDocIDs();
   try {
     get docIDs from somewhere & call writer.deleteByDocID
   } finally {
     writer.unfreezeDocIDs();
   }


Interesting idea, but would require the IndexWriter to flush the

buffered docs so an IndexReader could be created fro them. (orwould

require the existence of an UnflushedDocumentsIndexReader)


True.

Actually, an UnflushedDocumentsIndexReader would not be hard!

DocumentsWriter already has an IndexInput (ByteSliceReader) that can
read the postings for a single term from the RAM buffer (this is used
when flushing the segment).  I think it'd be straightforward to get
TermEnum/TermDocs/TermPositions iterators on the buffered docs.
Norms are already stored as byte arrays in memory.  FieldInfos is
already available.  The stored fields & term vectors are already
flushed to the directory so they could be read normally.

Hmm, buffered delete terms are tricky.  I guess freezeDocIDs would
have to flush deleted terms (and queries, if we add that) before
making a reader accessible,


If we buffer queries, that would seem to take care of 99% of the
usecases that need an IndexReader, right?   A custom query could get
ids from an index however it wanted.


I think so?

So, if we add only buffered "deleteByQuery" (and setNorm) toIndexWriter, is that enough to deprecate deleteDocument, setNorm inIndexReader?

though, the cost is shared because the
readers need to be opened anyway (so the app can find docIDs).

So maybe this approach becomes this:

   // Returns a "point in time" frozen view of index...
   IndexReader reader = writer.getReader();
   try {
     <get docIDs from reader, delete by docID>
  } finally {
     writer.releaseReader();
   }

?

We may even be able to implement this w/o actually freezing the
writer,
ie, still allowing add/updateDocument calls to proceed.
Merging could certainly still proceed.  This way you could at any
time ask a writer for a "point in time" reader, independent of what
else you are doing with the writer.  This would require, on flushing,
that writer goes and swaps in a "real" segment reader, limited to a
specified docID, for any point in time readers that are open.


Wow... sounds complex.

I think it may not be so bad ... the raw ingredients are already done(like ByteSliceReader) ... need to ponder it some more.

I think one very powerful side effect of doing this would be that youcould have extremely low latency indexing ("highly interactiveindexing"). You would add/delete docs using the writer, then quicklyre-open the reader, and be able to search the buffered docs withoutthe cost of flushing a new segment, assuming it's all within one JVM.

This reader (that searches both on-disk segments and the writer'sbuffered docs) would do reopen extremely efficiently. In the[distant?] future, it could even do searching "live", meaning thefull buffer is always searched rather than a point-in-time snapshot.But we couldn't really do this until we re-work the FieldCache API tobelong to each segment & be incrementally updateable such that if anew doc is added to the writer, we could efficiently update theFieldCache, if present. That would be a big change :)


Lots to think through ....

If we went that route, we'd need to expose methods inIndexWriter to
let you get reader(s), and, to then delete by docID.
Right... I had envisioned a callback that was called after a new
segment was created/flushed that passed IndexReader[].  In an
environment of mixed deletes and adds, it would avoid slowingdown the
indexing part by limiting where the deletes happen.


This would certainly be less work :)  I guess the question is how
severely are we limiting the application by requiring that you can
only do deletes when IW decides to flush, or, by forcing the
application to flush when it wants to do deletes.


Seems like more work, rather than limiting... "when" really isn't as
important as long as it's before a new external IndexReader is opened
for searching.

Right but if you want very low latency indexing (or even essentially0) then you can't really afford to buffer deletes (or adds) for thatlong...

It does put a little more burden on the user, but a slightly harder
(but more powerful / more efficient) API is preferable since easier
APIs can always be built on top (but not vice-versa).
True, though emulating the easier API on top of the "you get to
delete only when IW flushes" means you are forcing a flush, right?
I was thinking via buffering (the same way term deletes are handlednow).You keep track of maxDoc() at the time of the delete and defer ituntil later.


Oh, right, OK.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Unique doc ids

Reply via email to