Re: improve how IndexWriter uses RAM to buffer added documents

Michael McCandless Thu, 05 Apr 2007 12:06:30 -0700

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:

> > (I think for KS you "add" a previous segment not that
> > differently from how you "add" a document)?
> 
> Yeah.  KS has to decompress and serialize posting content, which sux.
> 
> The one saving grace is that with the Fibonacci merge schedule and  
> the seg-at-a-time indexing strategy, segments don't get merged nearly  
> as often as they do in Lucene.


Yeah we need to work on this one.  One thing that irks me about the
current Lucene merge policy (besides that it gets confused when you
flush-by-RAM-usage) is that it's a "pay it forward" design so you're
alwa>ys over-paying when you build a given index size.  With KS's
Fibonacci merge policy, you don't.  LUCENE-854 has some more details.

Segment merging really is costly.  In building a large (86 GB, 10 MM
docs) index, 65.6% of the time was spent merging!  Details are in
LUCENE-856...

> > On C) I think it is important so the many ports of Lucene can "compare
> > notes" and "cross fertilize".
> 
> Well, if you port Lucene's benchmarking stuff to Perl/C, I'll apply  
> the patch. ;)

I hear you!

> Cross-fertilization is a powerful tool for stimulating algorithmic  
> innovation.  Exhibit A: our unfolding collaborative successes.

Couldn't agree more.

> That's why it was built into the Lucy proposal:
> 
>      [Lucy's C engine] will provide core, performance-critical
>      functionality, but leave as much up to the higher-level
>      language as possible.
> 
> Users from diverse communities approach problems from different  
> angles and come up with different solutions.  The best ones will  
> propagate across Lucy bindings.
> 
> The only problem is that since Dave Balmain has been much less  
> available than we expected, it's been largely up to me to get Lucy to  
> critical mass where other people can start writing bindings.

This is a great model.  Are there Python bindings to Lucy yet/coming?

> > But does KS give its users a choice in Tokenizer?
> 
> You supply a regular expression which matches one token.
> 
>    # Presto! A WhiteSpaceTokenizer:
>    my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
>        token_re => qr/\S+/
>    );
> 
> > Or, can users pre-tokenize their fields themselves?
> 
> TokenBatch provides an API for bulk addition of tokens; you can  
> subclass Analyzer to exploit that.

Ahh, I get it.  Nice!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

Reply via email to