Re: Lucene and Google Web 1T 5 Gram

Mathieu Lecarme Wed, 23 Apr 2008 06:17:34 -0700

Rafael Turk a écrit :

Hi Folks,


   I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus contains
English word n-grams and their observed frequency counts. The length of the
n-grams ranges from unigrams(single words) to five-grams)

   I´m loading each ngram (each row is a ngram) as an individual Document.
This way I´ll be able to search for each ngram separated, but I´m ending
with huge indexes witch makes them very hard to load and read the index.

  Is there a better way to load and read ngrams to a Lucene index? Maybe
using lower level api?


More Info about Google Web 1T 5 Gram corpus at:
<http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>

Thanks,

Rafael


What do you wont to do?

If you wont an ngram => popularity map, just use a berkley DB, and usethis information in your Lucene application. Lucene is a reversed index,Berkeley DB an index.


M.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and Google Web 1T 5 Gram

Reply via email to