Cool! Sounds like with LUCENE-1458 we can experiment with some of these things. Does CSF become just another codec?
> I'm leary of having terms dict live entirely on disk, though we should certainly explore it. Yeah, it should theoretically help with reloading, it could use a skiplist (as we have a disk version of that implemented) instead of binarysearch). It seems like with things like TrieRange (which potentially adds many fields and terms) it could be useful to let the IO cache calculate what we need in RAM and what we don't, otherwise we're constantly at risk of exceeding heap usage. There'll be other potential RAM issues (such as page faults), but it seems like users will constantly be up against the inability to precalculate Java heap usage of data structures (whereas file based data usage can be measured). Norms are another example, and with flexible indexing (and scoring?) there may be additional fields the user may want to change dynamically, that if completely loaded into heap cause OOM problems. I guess I personally think it would be great to not worry about exceeding heap with Lucene apps (as it's a guessing game), and then one can simply analyze the OS level IO cache/swap space to see if the app could slow down due to the machine not having enough RAM. I think this would remove one of the major differences between a Java based search engine and a C++ based one. On Wed, Jun 10, 2009 at 1:26 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Wed, Jun 10, 2009 at 4:13 PM, Jason > Rutherglen<jason.rutherg...@gmail.com> wrote: > > Great! If I understand correctly it looks like RAM savings? Will > > there be an improvement in lookup speed? (We're using binary > > search here?). > > Yes, sizable RAM reduction for apps that have many unique terms. And, > init'ing (warming) the reader should be faster. > > Lookup speed should be faster (binary search against the terms in a > single field, not all terms). > > > Is there a precedence in database systems for what was mentioned > > about placing the term dict, delDocs, and filters onto disk and > > reading them from there (with the IO cache taking care of > > keeping the data in RAM)? (Would there be a future advantage to > > this approach when SSDs are more prevalent?) It seems like we > > could have some generalized pluggable system where one could try > > out this or the current heap approach, and benchmark. > > LUCENE-1458 creates exactly such a pluggable system. Ie it's lets you > swap in your own codec for terms, freq, prox, etc. > > But: I'm leary of having terms dict live entirely on disk, though we > should certainly explore it. > > > Given our continued inability to properly measure Java RAM > > usage, this approach may be a good one for Lucene? Where heap > > based LRU caches are a shot in the dark when it comes to mem > > size, as we never really know how much they're using. > > Well remember mmap uses an LRU policy to decide when pages are swapped > to disk... so a search that's unlucky can easily hit many page faults > just in consulting the terms dict. You could be at 200 msec cost > before you even hit a postings list... I prefer to have the terms > index RAM resident (of course the OS can still swap THAT out too...). > > > Once we generalize delDocs, filters, and field caches > > (LUCENE-831?), then perhaps CSF is a good place to test out this > > approach? We could have a generic class that handles the > > underlying IO that simply returns values based on a position or > > iteration. > > I agree, a CSF codec that uses mmap seems like a good place to > start... > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >