Ok. Thanks. I think I understand this now. I also think I have found the bug in the code which was causing the dodgy output.
So, in conclusion, would you say that a good automated check to see if the code is working correctly would be to add up the probabilities at the end of the EM iterations and check that probabilities add up to 1 (or slightly less)? James Quoting Philipp Koehn <[email protected]>: > Hi, > > because the final loop in each iteration is: > > // estimate probabilities > for all foreign words f > for all English words e > t(e|f) = count}(e|f) / total(f) > > As I said, there are two normalizations: one on the > sentence level, the other on the corpus level. > > -phi > > On Mon, Jul 27, 2009 at 10:30 PM, James Read<[email protected]> wrote: >> In that case I really don't see how the code is guaranteed to give results >> which add up to 1. >> >> Quoting Philipp Koehn <[email protected]>: >> >>> Hi, >>> >>> this is LaTex {algorithmic} code. >>> >>> count($e|f$) += $\frac{t(e|f)}{\text{s-total}(e)}$ >>> >>> means >>> >>> count(e|f) += t(e|f) / s-total(e) >>> >>> So, you got that right. >>> >>> -phi >>> >>> On Mon, Jul 27, 2009 at 10:18 PM, James Read<[email protected]> wrote: >>>> >>>> Hi, >>>> >>>> this seems to be pretty much what I implemented. What exactly do you mean >>>> by >>>> these three lines?: >>>> >>>> \STATE count($e|f$) += $\frac{t(e|f)}{\text{s-total}(e)}$ >>>> \STATE total($f$) += $\frac{t(e|f)}{\text{s-total}(e)}$ >>>> \STATE $t(e|f)$ = $\frac{\text{count}(e|f)}{\text{total}(f)}$ >>>> >>>> What do you mean by $\frac? The pseudocode I was using shows these lines >>>> as >>>> a simple division and this is what my code does. i.e >>>> >>>> t(e|f) = count(e|f) / total(f) >>>> >>>> In C code something like: >>>> >>>> for ( f = 0; f < size_source; f++ ) >>>> { >>>> for ( e = 0; e < size_target; e++ ) >>>> { >>>> t[f][e] = count[f][e] / total[f]; >>>> } >>>> } >>>> >>>> >>>> Is this the kind of thing you mean? >>>> >>>> Thanks >>>> James >>>> >>>> Quoting Philipp Koehn <[email protected]>: >>>> >>>>> Hi, >>>>> >>>>> I think there was a flaw in some versions of the pseudo code. >>>>> The probabilities certainly need to add up to one. There are >>>>> two normalizations going on in the algorithm: one on the sentence >>>>> level (so the probability of all alignments add up to one) and >>>>> one on the word level. >>>>> >>>>> Here the most recent version: >>>>> >>>>> \REQUIRE set of sentence pairs $(\text{\bf e},\text{\bf f})$ >>>>> \ENSURE translation prob. $t(e|f)$ >>>>> \STATE initialize $t(e|f)$ uniformly >>>>> \WHILE{not converged} >>>>> \STATE \COMMENT{initialize} >>>>> \STATE count($e|f$) = 0 {\bf for all} $e,f$ >>>>> \STATE total($f$) = 0 {\bf for all} $f$ >>>>> \FORALL{sentence pairs ({\bf e},{\bf f})} >>>>> \STATE \COMMENT{compute normalization} >>>>> \FORALL{words $e$ in {\bf e}} >>>>> \STATE s-total($e$) = 0 >>>>> \FORALL{words $f$ in {\bf f}} >>>>> \STATE s-total($e$) += $t(e|f)$ >>>>> \ENDFOR >>>>> \ENDFOR >>>>> \STATE \COMMENT{collect counts} >>>>> \FORALL{words $e$ in {\bf e}} >>>>> \FORALL{words $f$ in {\bf f}} >>>>> \STATE count($e|f$) += $\frac{t(e|f)}{\text{s-total}(e)}$ >>>>> \STATE total($f$) += $\frac{t(e|f)}{\text{s-total}(e)}$ >>>>> \ENDFOR >>>>> \ENDFOR >>>>> \ENDFOR >>>>> \STATE \COMMENT{estimate probabilities} >>>>> \FORALL{foreign words $f$} >>>>> \FORALL{English words $e$} >>>>> \STATE $t(e|f)$ = $\frac{\text{count}(e|f)}{\text{total}(f)}$ >>>>> \ENDFOR >>>>> \ENDFOR >>>>> \ENDWHILE >>>>> >>>>> -phi >>>>> >>>>> >>>>> >>>>> On Sun, Jul 26, 2009 at 5:24 PM, James Read <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I have implemented the EM Model 1 algorithm as outlined in Koehn's >>>>>> lecture notes. I was surprised to find the raw output of the algorithm >>>>>> gives a translation table that given any particular source word the >>>>>> sum of the probabilities of each possible target word is far greater >>>>>> than 1. >>>>>> >>>>>> Is this normal? >>>>>> >>>>>> Thanks >>>>>> James >>>>>> >>>>>> -- >>>>>> The University of Edinburgh is a charitable body, registered in >>>>>> Scotland, with registration number SC005336. >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> [email protected] >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> >>>> >>>> >>> >>> >> >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> >> > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
