Yeah, indexing all n-grams will be a killer. A Lucene index may work on smaller RAMs because it memmaps index files, but it still won't be a huge gain if the index is super large.
One way to try to overcome the problem may be to trim the "long tail" (n-grams with really low-frequency). Another that springs to my mind is to split n-grams into N buckets based on their frequency (frequent--least frequent) and then create a bloom filter for each bucket. This way you could heuristically check, for each suggestion, whether it exists in a given frequency bucket and sort them crudely (as opposed to exact count comparisons). This assumes your suggestions come from another algorithm and they just need to be ranked somehow. Dawid On Thu, Jun 12, 2014 at 11:37 AM, Daniel Naber <daniel.na...@languagetool.org> wrote: > On 2014-06-12 09:03, Dawid Weiss wrote: > > Hi Dawid, > > thanks for your fast response. > > >> What's your data and why do you need to cram everything in RAM? >> Perhaps there's some other options I could recommend? > > > I'm playing with the Google ngram index. It could be used to improve > LanguageTool's suggestions, by preferring a suggestion that's more common. > There's berkeleylm (https://code.google.com/p/berkeleylm/) for very fast > ngram lookups, but it's also RAM-based. As the ngram index is so huge, it > means one still needs large amounts of RAM (10GB when using the Web1T > corpus, according to the berkeleylm paper). > > Having a frequency lookup that requires less RAM but is at least not slow > would be nice. Next thing I'd try is a Lucene index. > > Regards > Daniel > ------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel