Hi,
The data is sentence-segmented. Assume you train your model with a training corpus which contains a single parallel sentence pair. Your training sentence has length L on both source and target side, and it's aligned along the diagonal. If n > L, you cannot extract any phrase of length n from this training corpus. If n <= L, you can extract L - n + 1 phrases of length n. Example: for L = 5 you can extract five phrases of length n = 1, four of length n = 2, ... , one of length n = 5, and none of length n > 5. Also, bilingual blocks are valid (=extractable) phrases only if they are consistent wrt. the word alignment. Larger blocks are possibly more frequently inconsistent. Of course you should consider some more aspects, e.g.: - training settings (there won't be any 8-grams if you set the max. phrase length to 7; long phrases will be affected more by a count cutoff because of sparsity) - vocabulary sizes limit the amount of possible combinations - n-gram entropy of the language [http://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf] Analyzing such things in detail is surely a fun pastime. You can start with vocabulary sizes, number of running words of your corpus, histograms of source-side training sentence lengths, number of distinct n-grams that appear in the source side of the corpus vs. number of distinct n-grams that are source sides of valid phrases, number of distinct n-grams that appear in the source side of the corpus if you undo the sentence segmentation (replace all line breaks by spaces), etc. Cheers, Matthias On Thu, 2015-01-15 at 16:39 +0000, Read, James C wrote: > Hi, > > > > I just ran a count of different sized n-grams in the source side of my > phrase table and this is what I got. > > > > unigrams 85,233 > > > bigrams 991,701 > > > trigrams 2,697,341 > > > 4-grams 3,876,180 > > > 5-grams 4,209,094 > > > 6-grams 3,702,813 > > > 7-grams 2,560,251 > > > 8-grams 0 > > > > So, up until the 5-grams the results are what I expected the number is > increasing. But then it drops for the 6-grams and drops again for the > 7-grams. > > > > Does anybody know why? > > > > James > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
