Hi,

The data is sentence-segmented.

Assume you train your model with a training corpus which contains a
single parallel sentence pair. Your training sentence has length L on
both source and target side, and it's aligned along the diagonal. 
If n > L, you cannot extract any phrase of length n from this training
corpus. If n <= L, you can extract L - n + 1 phrases of length n. 

Example: for L = 5 you can extract five phrases of length n = 1, four of
length n = 2, ... , one of length n = 5, and none of length n > 5.


Also, bilingual blocks are valid (=extractable) phrases only if they are
consistent wrt. the word alignment. Larger blocks are possibly more
frequently inconsistent.


Of course you should consider some more aspects, e.g.:

- training settings 
  (there won't be any 8-grams if you set the max. phrase length to 7; 
  long phrases will be affected more by a count cutoff because of sparsity)
- vocabulary sizes limit the amount of possible combinations
- n-gram entropy of the language 
  [http://languagelog.ldc.upenn.edu/myl/Shannon1950.pdf]


Analyzing such things in detail is surely a fun pastime. You can start
with vocabulary sizes, number of running words of your corpus,
histograms of source-side training sentence lengths, number of distinct
n-grams that appear in the source side of the corpus vs. number of
distinct n-grams that are source sides of valid phrases, number of
distinct n-grams that appear in the source side of the corpus if you
undo the sentence segmentation (replace all line breaks by spaces), etc.

Cheers,
Matthias



On Thu, 2015-01-15 at 16:39 +0000, Read, James C wrote:
> Hi,
> 
> 
> 
> I just ran a count of different sized n-grams in the source side of my
> phrase table and this is what I got.
> 
> 
> 
> unigrams     85,233
> 
> 
> bigrams       991,701
> 
> 
> trigrams   2,697,341
> 
> 
> 4-grams    3,876,180
> 
> 
> 5-grams    4,209,094
> 
> 
> 6-grams    3,702,813
> 
> 
> 7-grams    2,560,251
> 
> 
> 8-grams                   0
> 
> 
> 
> So, up until the 5-grams the results are what I expected the number is
> increasing. But then it drops for the 6-grams and drops again for the
> 7-grams.
> 
> 
> 
> Does anybody know why?
> 
> 
> 
> James 
> 
> 
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to