Many thanks. Just what I needed. James
Quoting Philipp Koehn <[email protected]>: > Hi, > > yes, you are missing this here: > "An Efficient Method for Determining Bilingual Word Classes" > http://www.aclweb.org/anthology/E99-1010 > > The numbers in the classes files are not counts, there are word classes. > > -phi > > On Tue, Sep 1, 2009 at 4:24 PM, James Read<[email protected]> wrote: >> Hi, >> >> Quoting Philipp Koehn <[email protected]>: >> >>> Hi, >>> >>> the computationally expensive part are the *.classes files. >>> >> >> again, these files just have two columns. A string in the first column and >> what looks like a count in the second column. Why should counting strings >> take so long? Am I missing something here? >> >> James >> >>> -phi >>> >>> On Tue, Sep 1, 2009 at 3:59 PM, James Read<[email protected]> wrote: >>>> >>>> Hi, >>>> >>>> Quoting Philipp Koehn <[email protected]>: >>>> >>>>> Hi, >>>>> >>>>> yes, it is correct that step 1 is doing just the data preparation for >>>>> GIZA++. >>>>> The most time-consuming step is running mkcls to creake the classes >>>>> for the relative distortion models. >>>>> >>>> >>>> Do you mean the *.vcb files that are created in Step 1? These just look >>>> like >>>> dictionary files with 3 fields a) a numeric ID, b) the word entry, c) the >>>> frequency of the string. My make_dictionary function does this in about >>>> 20 >>>> seconds. Why is mkcls taking so long? Is it doing something complicated >>>> that >>>> I have missed here? >>>> >>>> James >>>> >>>>> -phi >>>>> >>>>> On Mon, Aug 31, 2009 at 4:39 PM, James Read<[email protected]> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> does anyone know what step 1 of the moses training script does other >>>>>> than produce the dictionaries and the numerical sentences that enable >>>>>> GIZA++ to do its job. The reason I ask is that on my machine step 1 >>>>>> takes just over 70 mins for en-fr Europarl corpus. >>>>>> >>>>>> My optimised version of data preparation and EM IBM Model 1 completes >>>>>> is 121 seconds for five iterations of EM, that's just over 2 minutes. >>>>>> Before publishing these results I just wanted to make sure there's >>>>>> nothing I've missed about step 1 of the training process. Does it do >>>>>> anything at all that influences GIZA++ other than preparing the >>>>>> digital sentences? >>>>>> >>>>>> James >>>>>> >>>>>> -- >>>>>> The University of Edinburgh is a charitable body, registered in >>>>>> Scotland, with registration number SC005336. >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> [email protected] >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> >>>> >>>> >>> >>> >> >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> >> > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
