Re: Future projects

Michael McCandless Tue, 07 Apr 2009 01:46:24 -0700

On Mon, Apr 6, 2009 at 6:43 PM, Jason Rutherglen
<jason.rutherg...@gmail.com> wrote:
>> The realtime reader would have to have sub-readers per thread,
> and an aggregate reader that "joins" them by interleaving the
> docIDs
>
> Nice (i.e. nice and complex)!


Right, this is why I like the current [simple] near real-time
approach.  I think we should keep it simple, unless we discover real
perf problems with the current approach.

> Not knowing too much about the
> internals, how would the interleaving work? Does each subreader
> have a "start" ala Multi*Reader? Or are the doc ids incremented
> from a synced place such that no two readers have the same doc
> id?

The docID must be woven/interleaved together (unlike MultiReader,
where they are concatenated).  DW ensures that a given docID is used
by only 1 thread.  So you'd need to do a merge sort (across the N
thread states) on reading the postings for a given term.  Probably
we'd then suggest for best searching performance to use a single
thread for indexing when NRT search will be used.

>> BTW there are benefits to not reusing the RAM buffer, outside
> of faster near real-time search
>
> Not reusing the RAM buffer means not reusing the pooled byte
> arrays after a flush or something else?

Pooled byte, char and int arrays, *PerThread, *PerField classes, norms, etc.

> SSDs are cool, I can't see management approving of those quite
> yet, are there many places piloting Lucene on SSDs that you're
> aware of?

Yes they are still somewhat expensive, though the gain in productivity
is sizable, and prices have been coming down...

EG I have a zillion Lucene source code checkouts, and it used to be
whenever I switch back to one and do an "svn up" or "svn diff", it's a
good 30 seconds of disk heads grinding away before anything really
happened.  Now it's a second or two.  VMWare/Parellels also become
much more responsive.  Not hearing  disk heads grinding is somewhat
disconcerting at first, though.

At least several people on java-user have posted benchmarks with SSDs.

SSDs are clearly the future and I think we need to think more about
what their adoption means for our prioritization of Lucene's ongoing
improvements.  EG I think it means the CPU cost of searching, single
search concurrency (using more than one thread on one search), become
important, because once your index is on an SSD Lucene will spend far
less time waiting for IO to complete even on "normal" installations
that don't cache the entire index in RAM.  I think we especially need
to figure out how to leverage concurrency in the IO system (but alas
we don't have an async IO API from Java... we'd have to "emulate" it
using threads).

> From what you've said so far, this is how I understand realtime
> ram buffer readers could work:
>
> There'd be a IndexWriter.getRAMReader method that gathers all
> the ram buffers from the various threads, marks a doc id as the
> last one for the overall RAMBufferMultiReader. A new set of
> classes, RAMBufferTermEnum, RAMBufferTermDocs,
> RAMBufferTermPositions would be implemented that can read from
> the ram buffer.

Right, but we shouldn't start work on this until we see a reason to.
And even once that reason appears, we should next do the intermediate
optimization of using RAMDir for newly flushed segments.

> I don't think the current field cache API would like growing
> arrays? Something hopefully LUCENE-831 will support.

I'm thinking for LUCENE-831 we should make field-cache segment
centric, which would then play well w/ NRT.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

Reply via email to