Hi Raphael,

We initially tried to do the same but ended up developing our own API for
querying the Web 1T. You can find more details on
http://digitalpebble.com/resources.html
There could be a way to reuse elements from Lucene e.g. the Term index only
but I could not find an obvious way to achieve that.

Best,

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com


On 23/04/2008, Rafael Turk <[EMAIL PROTECTED]> wrote:
>
> Hi Folks,
>
>    I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
> contains
> English word n-grams and their observed frequency counts. The length of
> the
> n-grams ranges from unigrams(single words) to five-grams)
>
>    I´m loading each ngram (each row is a ngram) as an individual Document.
> This way I´ll be able to search for each ngram separated, but I´m ending
> with huge indexes witch makes them very hard to load and read the index.
>
>   Is there a better way to load and read ngrams to a Lucene index? Maybe
> using lower level api?
>
>
> More Info about Google Web 1T 5 Gram corpus at:
> <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
>
> Thanks,
>
>
> Rafael
>

Reply via email to