Re: Future projects

Jason Rutherglen Fri, 03 Apr 2009 17:02:08 -0700

I looked at the IndexWriter code in regards to creating a realtime reader,
with the many flexible indexing classes I'm unsure of how one would get a
frozenish IndexInput of the byte slices, given the byte slices are attached
to different threads?


On Fri, Apr 3, 2009 at 2:42 PM, Jason Rutherglen <[email protected]
> wrote:

> > I think the realtime reader'd just store the maxDocID it's allowed to
> search, and we would likely keep using the RAM format now used.
>
> Sounds pretty good.  Are there any other gotchas in the design?
>
>
>
> On Thu, Apr 2, 2009 at 1:40 AM, Michael McCandless <
> [email protected]> wrote:
>
>> On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen
>> <[email protected]> wrote:
>> > Now that LUCENE-1516 is close to being committed perhaps we can
>> > figure out the priority of other issues:
>> >
>> > 1. Searchable IndexWriter RAM buffer
>>
>> I think first priority is to get a good assessment of the performance
>> of the current implementation (from LUCENE-1516).
>>
>> My initial tests are very promising: with a writer updating (replacing
>> random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I
>> was able to get reopen the reader once per second and do a large (>
>> 500K results) search that sorts by date.  The reopen time was
>> typically ~40 msec, and search time typically ~35 msec (though there
>> were random spikes up to ~340 msec).  Though, these results were on an
>> SSD (Intel X25M 160 GB).
>>
>> We need more datapoints of the current approach, but this looks likely
>> to be good enough for starters.  And since we can get it into 2.9,
>> hopefully it'll get some early usage and people will report back to
>> help us assess whether further performance improvements are necessary.
>>
>> If they do turn out to be necessary, I think before your step 1, we
>> should write small segments into a RAMDirectory instead of the "real"
>> directory.  That's simpler than truly searching IndexWriter's
>> in-memory postings data.
>>
>> > 2. Finish up benchmarking and perhaps implement passing
>> > filters to the SegmentReader level
>>
>> What is "passing filters to the SegmentReader level"?  EG as of
>> LUCENE-1483, we now ask a Filter for it's DocIdSet once per
>> SegmentReader.
>>
>> > 3. Deleting by doc id using IndexWriter
>>
>> We need a clean approach for the "docIDs suddenly shift when merge is
>> committed" problem for this...
>>
>> Thinking more on this... I think one possible solution may be to
>> somehow expose IndexWriter's internal docID remapping code.
>> IndexWriter does delete by docID internally, and whenever a merge is
>> committed we stop-the-world (sync on IW) and go remap those docIDs.
>> If we somehow allowed user to register a callback that we could call
>> when this remapping occurs, then user's code could carry the docIDs
>> without them becoming stale.  Or maybe we could make a class
>> "PendingDocIDs", which you'd ask the reader to give you, that holds
>> docIDs and remaps them after each merge.  The problem is, IW
>> internally always logically switches to the current reader for any
>> further docID deletion, but the user's code may continue to use an old
>> reader.  So simply exposing this remapping won't fix it... we'd need
>> to somehow track the genealogy (quite a bit more complex).
>>
>> > With 1) I'm interested in how we will lock a section of the
>> > bytes for use by a given reader? We would not actually lock
>> > them, but we need to set aside the bytes such that for example
>> > if the postings grows, TermDocs iteration does not progress to
>> > beyond it's limits. Are there any modifications that are needed
>> > of the RAM buffer format? How would the term table be stored? We
>> > would not be using the current hash method?
>>
>> I think the realtime reader'd just store the maxDocID it's allowed to
>> search, and we would likely keep using the RAM format now used.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Re: Future projects

Reply via email to