Hi,

 

This is still all very new to me so apologies if this is not the correct
place to ask this questions.

 

I am wanting to take the English Trie Language Model (5.5TB) created from
the Common Crawl data set:

http://data.statmt.org/ngrams/lm/en.trie

 

Then extract all n-grams that contain a certain word. This needs to be done
for a list of 100 words. For example if I was looking for all n-grams that
contained the word "discombobulated" I would want an output file containing
the n-gram that contains that word and the number of times that n-gram
occurs:

word1 discombobulated 25

word1 discombobulated word3 40

 

Due to the size of the file, this is something I am keen to get right first
time. For this reason is someone able to give me an example of how this can
be done and would this kind of query be possible with 64GB of RAM?

 

Thanks,

Graeme

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to