Hmm... so I suspect the fst suggest module must first gather up all titles, then sort them, in RAM, and then build the actual FST. Maybe it's this gather + sort that's taking so much RAM?
1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So that shouldn't be it... Is this a an accessible corpus? Can I somehow get a copy to play with...? Are you able to [temporarily, once] build the full FST and other suggest impls and compare how much RAM is required for building and then lookups? Mike McCandless http://blog.mikemccandless.com On Wed, Jul 6, 2011 at 1:50 PM, Elmer <evanchaste...@gmail.com> wrote: > Hi Mike, > > That's what I thought when I started indexing it. To be clear, it happens on > build time. > I don't know if memory efficiency is better when building has finished. > > The titles I index are titles from the dblp computer sience bibliography. > They can take up to... say 100 characters. > Examples: > ------- > - Auditory stimulus optimization with feedback from fuzzy clustering of > neuronal responses > - Two-objective method for crisp and fuzzy interval comparison in > optimization > - Bound Constrained Smooth Optimization for Solving Variational Inequalities > and Related Problems > - Retrieval of bibliographic records using Apache Lucene > - Digital Library Information Appliances > ------- > > The "title_suggest" field uses the KeyWordTokenizer and LowerCaseFilter in > that order. > > I also tried to do the same for the author names, and this works without > problems. Actually it builds the tree/fsa/... faster from dictionary than > from file (the lookup data file that can be stored and loaded through the > .store and .load methods). But the larger set of publication titles is > currently no-go with 2.5GB of heapspace, only having a main class that > builds the LookUp data. > > BR, > Elmer > > > -----Oorspronkelijk bericht----- From: Michael McCandless > Sent: Wednesday, July 06, 2011 6:23 PM > To: java-user@lucene.apache.org > Subject: Re: Autocompletion on large index > > You could try storing your autocomplete index in a RAMDirectory? > > But: I'm surprised you see the FST suggest impl using up so much RAM; > very low memory usage is one of the strengths of the FST approach. > Can you share the text (titles) you are feeding to the suggest module? > > Mike McCandless > > http://blog.mikemccandless.com > > On Wed, Jul 6, 2011 at 12:08 PM, Elmer <evanchaste...@gmail.com> wrote: >> >> Hi again. >> >> I have created my own autocompleter based on the spellchecker. This >> works well in a sense that it is able to create an auto completion index >> from my 'publication' index. However, integrated in my web application, >> each keypress asks autocompleter to search the index, which is stored on >> disk (not in mem), just like spellchecker does (except that spellchecker >> is not invoked every keypress). >> With Lucene 3.3.0, auto completion modules are included, which load >> their trees/fsa/... in memory. I'd like to use these modules, but the >> problem is that they use more than 2.5GB, causing heap space exceptions. >> This happens when I try to build a LookUp index (fst,jaspell or tst, >> doesn't matter) from my 'publication' index consisting of 1.3M >> publications. The field I use for autocompletion holds the titles of the >> publications indexed untokenized (but lowercased). >> >> Code: >> Lookup autoCompleter = new TSTLookup(); >> FSDirectory dir = FSDirectory.open(new File("PATHTOINDEX")); >> LuceneDictionary dict = new >> LuceneDictionary(IndexReader.open(dir),"title_suggest"); >> autoCompleter.build(dict); >> >> Is it possible to have the autocompletion module to work in-memory on >> such a dataset without increasing java's heapspace? >> FTR, the 3.3.0 autocompletion modules use more than 2.5GB of RAM, where >> my own autocompleter index is stored on disk using about 300MB. >> >> BR, >> Elmer >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org