On Mon, Apr 6, 2009 at 6:43 PM, Jason Rutherglen <jason.rutherg...@gmail.com> wrote: >> The realtime reader would have to have sub-readers per thread, > and an aggregate reader that "joins" them by interleaving the > docIDs > > Nice (i.e. nice and complex)!
Right, this is why I like the current [simple] near real-time approach. I think we should keep it simple, unless we discover real perf problems with the current approach. > Not knowing too much about the > internals, how would the interleaving work? Does each subreader > have a "start" ala Multi*Reader? Or are the doc ids incremented > from a synced place such that no two readers have the same doc > id? The docID must be woven/interleaved together (unlike MultiReader, where they are concatenated). DW ensures that a given docID is used by only 1 thread. So you'd need to do a merge sort (across the N thread states) on reading the postings for a given term. Probably we'd then suggest for best searching performance to use a single thread for indexing when NRT search will be used. >> BTW there are benefits to not reusing the RAM buffer, outside > of faster near real-time search > > Not reusing the RAM buffer means not reusing the pooled byte > arrays after a flush or something else? Pooled byte, char and int arrays, *PerThread, *PerField classes, norms, etc. > SSDs are cool, I can't see management approving of those quite > yet, are there many places piloting Lucene on SSDs that you're > aware of? Yes they are still somewhat expensive, though the gain in productivity is sizable, and prices have been coming down... EG I have a zillion Lucene source code checkouts, and it used to be whenever I switch back to one and do an "svn up" or "svn diff", it's a good 30 seconds of disk heads grinding away before anything really happened. Now it's a second or two. VMWare/Parellels also become much more responsive. Not hearing disk heads grinding is somewhat disconcerting at first, though. At least several people on java-user have posted benchmarks with SSDs. SSDs are clearly the future and I think we need to think more about what their adoption means for our prioritization of Lucene's ongoing improvements. EG I think it means the CPU cost of searching, single search concurrency (using more than one thread on one search), become important, because once your index is on an SSD Lucene will spend far less time waiting for IO to complete even on "normal" installations that don't cache the entire index in RAM. I think we especially need to figure out how to leverage concurrency in the IO system (but alas we don't have an async IO API from Java... we'd have to "emulate" it using threads). > From what you've said so far, this is how I understand realtime > ram buffer readers could work: > > There'd be a IndexWriter.getRAMReader method that gathers all > the ram buffers from the various threads, marks a doc id as the > last one for the overall RAMBufferMultiReader. A new set of > classes, RAMBufferTermEnum, RAMBufferTermDocs, > RAMBufferTermPositions would be implemented that can read from > the ram buffer. Right, but we shouldn't start work on this until we see a reason to. And even once that reason appears, we should next do the intermediate optimization of using RAMDir for newly flushed segments. > I don't think the current field cache API would like growing > arrays? Something hopefully LUCENE-831 will support. I'm thinking for LUCENE-831 we should make field-cache segment centric, which would then play well w/ NRT. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org