I've observed this as well. It seems to me there are several competing 
pressures affecting the number of ngram types in  a corpus. On the one hand, as 
the size of the corpus increases, so does the vocabulary. This obviously 
increases the number of unigram types (which is the same as the vocabulary 
size), but also increases all of the other ngram sizes as well. But the other 
effect is that language is hugely constrained by context, and the longer the 
context (i.e. the longer the ngram) the less freedom there is for what can 
reasonably say next. If I say "the big", there are lots of reasonable choices 
for the third word, but if I say "I was frightened by the barking of the big", 
there are very few sensible completions.

You could quantify this by computing perplexity at various ngram sizes, but 
that's just another way of measuring the same effect you see with your ngram 
counts.

Of course this could be complete nonsense - I'm eager to hear what other people 
think.

- John Burger
 MITRE

On Jan 15, 2015, at 11:39 , Read, James C <[email protected]> wrote:

> Hi,
> 
> I just ran a count of different sized n-grams in the source side of my phrase 
> table and this is what I got.
> 
> unigrams     85,233
> bigrams       991,701
> trigrams   2,697,341
> 4-grams    3,876,180
> 5-grams    4,209,094
> 6-grams    3,702,813
> 7-grams    2,560,251
> 8-grams                   0
> 
> So, up until the 5-grams the results are what I expected the number is 
> increasing. But then it drops for the 6-grams and drops again for the 7-grams.
> 
> Does anybody know why?
> 
> James 
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to