I've observed this as well. It seems to me there are several competing pressures affecting the number of ngram types in a corpus. On the one hand, as the size of the corpus increases, so does the vocabulary. This obviously increases the number of unigram types (which is the same as the vocabulary size), but also increases all of the other ngram sizes as well. But the other effect is that language is hugely constrained by context, and the longer the context (i.e. the longer the ngram) the less freedom there is for what can reasonably say next. If I say "the big", there are lots of reasonable choices for the third word, but if I say "I was frightened by the barking of the big", there are very few sensible completions.
You could quantify this by computing perplexity at various ngram sizes, but that's just another way of measuring the same effect you see with your ngram counts. Of course this could be complete nonsense - I'm eager to hear what other people think. - John Burger MITRE On Jan 15, 2015, at 11:39 , Read, James C <[email protected]> wrote: > Hi, > > I just ran a count of different sized n-grams in the source side of my phrase > table and this is what I got. > > unigrams 85,233 > bigrams 991,701 > trigrams 2,697,341 > 4-grams 3,876,180 > 5-grams 4,209,094 > 6-grams 3,702,813 > 7-grams 2,560,251 > 8-grams 0 > > So, up until the 5-grams the results are what I expected the number is > increasing. But then it drops for the 6-grams and drops again for the 7-grams. > > Does anybody know why? > > James > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
