Michael: I love your suggestion on 3)! This really opens doors for flexible indexing.
-John On Thu, Apr 2, 2009 at 1:40 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen > <jason.rutherg...@gmail.com> wrote: > > Now that LUCENE-1516 is close to being committed perhaps we can > > figure out the priority of other issues: > > > > 1. Searchable IndexWriter RAM buffer > > I think first priority is to get a good assessment of the performance > of the current implementation (from LUCENE-1516). > > My initial tests are very promising: with a writer updating (replacing > random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I > was able to get reopen the reader once per second and do a large (> > 500K results) search that sorts by date. The reopen time was > typically ~40 msec, and search time typically ~35 msec (though there > were random spikes up to ~340 msec). Though, these results were on an > SSD (Intel X25M 160 GB). > > We need more datapoints of the current approach, but this looks likely > to be good enough for starters. And since we can get it into 2.9, > hopefully it'll get some early usage and people will report back to > help us assess whether further performance improvements are necessary. > > If they do turn out to be necessary, I think before your step 1, we > should write small segments into a RAMDirectory instead of the "real" > directory. That's simpler than truly searching IndexWriter's > in-memory postings data. > > > 2. Finish up benchmarking and perhaps implement passing > > filters to the SegmentReader level > > What is "passing filters to the SegmentReader level"? EG as of > LUCENE-1483, we now ask a Filter for it's DocIdSet once per > SegmentReader. > > > 3. Deleting by doc id using IndexWriter > > We need a clean approach for the "docIDs suddenly shift when merge is > committed" problem for this... > > Thinking more on this... I think one possible solution may be to > somehow expose IndexWriter's internal docID remapping code. > IndexWriter does delete by docID internally, and whenever a merge is > committed we stop-the-world (sync on IW) and go remap those docIDs. > If we somehow allowed user to register a callback that we could call > when this remapping occurs, then user's code could carry the docIDs > without them becoming stale. Or maybe we could make a class > "PendingDocIDs", which you'd ask the reader to give you, that holds > docIDs and remaps them after each merge. The problem is, IW > internally always logically switches to the current reader for any > further docID deletion, but the user's code may continue to use an old > reader. So simply exposing this remapping won't fix it... we'd need > to somehow track the genealogy (quite a bit more complex). > > > With 1) I'm interested in how we will lock a section of the > > bytes for use by a given reader? We would not actually lock > > them, but we need to set aside the bytes such that for example > > if the postings grows, TermDocs iteration does not progress to > > beyond it's limits. Are there any modifications that are needed > > of the RAM buffer format? How would the term table be stored? We > > would not be using the current hash method? > > I think the realtime reader'd just store the maxDocID it's allowed to > search, and we would likely keep using the RAM format now used. > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >