Yonik Seeley wrote:
On Jan 24, 2008 5:47 AM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
Yonik Seeley wrote:
On Jan 23, 2008 6:34 AM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
writer.freezeDocIDs();
try {
get docIDs from somewhere & call writer.deleteByDocID
} finally {
writer.unfreezeDocIDs();
}
Interesting idea, but would require the IndexWriter to flush the
buffered docs so an IndexReader could be created fro them. (or
would
require the existence of an UnflushedDocumentsIndexReader)
True.
Actually, an UnflushedDocumentsIndexReader would not be hard!
DocumentsWriter already has an IndexInput (ByteSliceReader) that can
read the postings for a single term from the RAM buffer (this is used
when flushing the segment). I think it'd be straightforward to get
TermEnum/TermDocs/TermPositions iterators on the buffered docs.
Norms are already stored as byte arrays in memory. FieldInfos is
already available. The stored fields & term vectors are already
flushed to the directory so they could be read normally.
Hmm, buffered delete terms are tricky. I guess freezeDocIDs would
have to flush deleted terms (and queries, if we add that) before
making a reader accessible,
If we buffer queries, that would seem to take care of 99% of the
usecases that need an IndexReader, right? A custom query could get
ids from an index however it wanted.
I think so?
So, if we add only buffered "deleteByQuery" (and setNorm) to
IndexWriter, is that enough to deprecate deleteDocument, setNorm in
IndexReader?
though, the cost is shared because the
readers need to be opened anyway (so the app can find docIDs).
So maybe this approach becomes this:
// Returns a "point in time" frozen view of index...
IndexReader reader = writer.getReader();
try {
<get docIDs from reader, delete by docID>
} finally {
writer.releaseReader();
}
?
We may even be able to implement this w/o actually freezing the
writer,
ie, still allowing add/updateDocument calls to proceed.
Merging could certainly still proceed. This way you could at any
time ask a writer for a "point in time" reader, independent of what
else you are doing with the writer. This would require, on flushing,
that writer goes and swaps in a "real" segment reader, limited to a
specified docID, for any point in time readers that are open.
Wow... sounds complex.
I think it may not be so bad ... the raw ingredients are already done
(like ByteSliceReader) ... need to ponder it some more.
I think one very powerful side effect of doing this would be that you
could have extremely low latency indexing ("highly interactive
indexing"). You would add/delete docs using the writer, then quickly
re-open the reader, and be able to search the buffered docs without
the cost of flushing a new segment, assuming it's all within one JVM.
This reader (that searches both on-disk segments and the writer's
buffered docs) would do reopen extremely efficiently. In the
[distant?] future, it could even do searching "live", meaning the
full buffer is always searched rather than a point-in-time snapshot.
But we couldn't really do this until we re-work the FieldCache API to
belong to each segment & be incrementally updateable such that if a
new doc is added to the writer, we could efficiently update the
FieldCache, if present. That would be a big change :)
Lots to think through ....
If we went that route, we'd need to expose methods in
IndexWriter to
let you get reader(s), and, to then delete by docID.
Right... I had envisioned a callback that was called after a new
segment was created/flushed that passed IndexReader[]. In an
environment of mixed deletes and adds, it would avoid slowing
down the
indexing part by limiting where the deletes happen.
This would certainly be less work :) I guess the question is how
severely are we limiting the application by requiring that you can
only do deletes when IW decides to flush, or, by forcing the
application to flush when it wants to do deletes.
Seems like more work, rather than limiting... "when" really isn't as
important as long as it's before a new external IndexReader is opened
for searching.
Right but if you want very low latency indexing (or even essentially
0) then you can't really afford to buffer deletes (or adds) for that
long...
It does put a little more burden on the user, but a slightly harder
(but more powerful / more efficient) API is preferable since easier
APIs can always be built on top (but not vice-versa).
True, though emulating the easier API on top of the "you get to
delete only when IW flushes" means you are forcing a flush, right?
I was thinking via buffering (the same way term deletes are handled
now).
You keep track of maxDoc() at the time of the delete and defer it
until later.
Oh, right, OK.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]