Re: Future projects

Jason Rutherglen Thu, 02 Apr 2009 11:08:28 -0700

I'm interested in merging cached bitsets and field caches.  While this may
be something related to LUCENE-831, in Bobo there are custom field caches
which we want to merge in RAM (rather than reload from the reader using
termenum + termdocs).  This could somehow lead to delete by doc id.


Tracking the genealogy of segments is something we can provide as a callback
from IndexWriter?  Or could we add a method to IndexCommit or SegmentReader
that returns the segments it originated from?

On Thu, Apr 2, 2009 at 1:40 AM, Michael McCandless <
[email protected]> wrote:

> On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen
> <[email protected]> wrote:
> > Now that LUCENE-1516 is close to being committed perhaps we can
> > figure out the priority of other issues:
> >
> > 1. Searchable IndexWriter RAM buffer
>
> I think first priority is to get a good assessment of the performance
> of the current implementation (from LUCENE-1516).
>
> My initial tests are very promising: with a writer updating (replacing
> random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I
> was able to get reopen the reader once per second and do a large (>
> 500K results) search that sorts by date.  The reopen time was
> typically ~40 msec, and search time typically ~35 msec (though there
> were random spikes up to ~340 msec).  Though, these results were on an
> SSD (Intel X25M 160 GB).
>
> We need more datapoints of the current approach, but this looks likely
> to be good enough for starters.  And since we can get it into 2.9,
> hopefully it'll get some early usage and people will report back to
> help us assess whether further performance improvements are necessary.
>
> If they do turn out to be necessary, I think before your step 1, we
> should write small segments into a RAMDirectory instead of the "real"
> directory.  That's simpler than truly searching IndexWriter's
> in-memory postings data.
>
> > 2. Finish up benchmarking and perhaps implement passing
> > filters to the SegmentReader level
>
> What is "passing filters to the SegmentReader level"?  EG as of
> LUCENE-1483, we now ask a Filter for it's DocIdSet once per
> SegmentReader.
>
> > 3. Deleting by doc id using IndexWriter
>
> We need a clean approach for the "docIDs suddenly shift when merge is
> committed" problem for this...
>
> Thinking more on this... I think one possible solution may be to
> somehow expose IndexWriter's internal docID remapping code.
> IndexWriter does delete by docID internally, and whenever a merge is
> committed we stop-the-world (sync on IW) and go remap those docIDs.
> If we somehow allowed user to register a callback that we could call
> when this remapping occurs, then user's code could carry the docIDs
> without them becoming stale.  Or maybe we could make a class
> "PendingDocIDs", which you'd ask the reader to give you, that holds
> docIDs and remaps them after each merge.  The problem is, IW
> internally always logically switches to the current reader for any
> further docID deletion, but the user's code may continue to use an old
> reader.  So simply exposing this remapping won't fix it... we'd need
> to somehow track the genealogy (quite a bit more complex).
>
> > With 1) I'm interested in how we will lock a section of the
> > bytes for use by a given reader? We would not actually lock
> > them, but we need to set aside the bytes such that for example
> > if the postings grows, TermDocs iteration does not progress to
> > beyond it's limits. Are there any modifications that are needed
> > of the RAM buffer format? How would the term table be stored? We
> > would not be using the current hash method?
>
> I think the realtime reader'd just store the maxDocID it's allowed to
> search, and we would likely keep using the RAM format now used.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Future projects

Reply via email to