Re: Autocompletion on large index

Michael McCandless Wed, 06 Jul 2011 11:40:02 -0700

Hmm... so I suspect the fst suggest module must first gather up all
titles, then sort them, in RAM, and then build the actual FST.  Maybe
it's this gather + sort that's taking so much RAM?


1.3 M publications times 100 chars times 2 bytes/char = ~248 MB.  So
that shouldn't be it...

Is this a an accessible corpus?  Can I somehow get a copy to play with...?

Are you able to [temporarily, once] build the full FST and other
suggest impls and compare how much RAM is required for building and
then lookups?

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jul 6, 2011 at 1:50 PM, Elmer <[email protected]> wrote:
> Hi Mike,
>
> That's what I thought when I started indexing it. To be clear, it happens on
> build time.
> I don't know if memory efficiency is better when building has finished.
>
> The titles I index are titles from the dblp computer sience bibliography.
> They can take up to... say 100 characters.
> Examples:
> -------
> - Auditory stimulus optimization with feedback from fuzzy clustering of
> neuronal responses
> - Two-objective method for crisp and fuzzy interval comparison in
> optimization
> - Bound Constrained Smooth Optimization for Solving Variational Inequalities
> and Related Problems
> - Retrieval of bibliographic records using Apache Lucene
> - Digital Library Information Appliances
> -------
>
> The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter in
> that order.
>
> I also tried to do the same for the author names, and this works without
> problems. Actually it builds the tree/fsa/... faster from dictionary than
> from file (the lookup data file that can be stored and loaded through the
> .store and .load methods). But the larger set of publication titles is
> currently no-go with 2.5GB of heapspace, only having a main class that
> builds the LookUp data.
>
> BR,
> Elmer
>
>
> -----Oorspronkelijk bericht----- From: Michael McCandless
> Sent: Wednesday, July 06, 2011 6:23 PM
> To: [email protected]
> Subject: Re: Autocompletion on large index
>
> You could try storing your autocomplete index in a RAMDirectory?
>
> But: I'm surprised you see the FST suggest impl using up so much RAM;
> very low memory usage is one of the strengths of the FST approach.
> Can you share the text (titles) you are feeding to the suggest module?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Jul 6, 2011 at 12:08 PM, Elmer <[email protected]> wrote:
>>
>> Hi again.
>>
>> I have created my own autocompleter based on the spellchecker. This
>> works well in a sense that it is able to create an auto completion index
>> from my 'publication' index. However, integrated in my web application,
>> each keypress asks autocompleter to search the index, which is stored on
>> disk (not in mem), just like spellchecker does (except that spellchecker
>> is not invoked every keypress).
>> With Lucene 3.3.0, auto completion modules are included, which load
>> their trees/fsa/... in memory. I'd like to use these modules, but the
>> problem is that they use more than 2.5GB, causing heap space exceptions.
>> This happens when I try to build a LookUp index (fst,jaspell or tst,
>> doesn't matter) from my 'publication' index consisting of 1.3M
>> publications. The field I use for autocompletion holds the titles of the
>> publications indexed untokenized (but lowercased).
>>
>> Code:
>> Lookup autoCompleter = new TSTLookup();
>> FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX"));
>> LuceneDictionary dict = new
>> LuceneDictionary(IndexReader.open(dir),"title_suggest");
>> autoCompleter.build(dict);
>>
>> Is it possible to have the autocompletion module to work in-memory on
>> such a dataset without increasing java's heapspace?
>> FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where
>> my own autocompleter index is stored on disk using about 300MB.
>>
>> BR,
>> Elmer
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Autocompletion on large index

Reply via email to