Re: Morfologik for large data sets

Daniel Naber Thu, 12 Jun 2014 02:38:26 -0700

On 2014-06-12 09:03, Dawid Weiss wrote:

Hi Dawid,


thanks for your fast response.

> What's your data and why do you need to cram everything in RAM?
> Perhaps there's some other options I could recommend?

I'm playing with the Google ngram index. It could be used to improve 
LanguageTool's suggestions, by preferring a suggestion that's more 
common. There's berkeleylm (https://code.google.com/p/berkeleylm/) for 
very fast ngram lookups, but it's also RAM-based. As the ngram index is 
so huge, it means one still needs large amounts of RAM (10GB when using 
the Web1T corpus, according to the berkeleylm paper).

Having a frequency lookup that requires less RAM but is at least not 
slow would be nice. Next thing I'd try is a Lucene index.

Regards
  Daniel


------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Morfologik for large data sets

Reply via email to