Hi, yes, you are missing this here: "An Efficient Method for Determining Bilingual Word Classes" http://www.aclweb.org/anthology/E99-1010
The numbers in the classes files are not counts, there are word classes. -phi On Tue, Sep 1, 2009 at 4:24 PM, James Read<[email protected]> wrote: > Hi, > > Quoting Philipp Koehn <[email protected]>: > >> Hi, >> >> the computationally expensive part are the *.classes files. >> > > again, these files just have two columns. A string in the first column and > what looks like a count in the second column. Why should counting strings > take so long? Am I missing something here? > > James > >> -phi >> >> On Tue, Sep 1, 2009 at 3:59 PM, James Read<[email protected]> wrote: >>> >>> Hi, >>> >>> Quoting Philipp Koehn <[email protected]>: >>> >>>> Hi, >>>> >>>> yes, it is correct that step 1 is doing just the data preparation for >>>> GIZA++. >>>> The most time-consuming step is running mkcls to creake the classes >>>> for the relative distortion models. >>>> >>> >>> Do you mean the *.vcb files that are created in Step 1? These just look >>> like >>> dictionary files with 3 fields a) a numeric ID, b) the word entry, c) the >>> frequency of the string. My make_dictionary function does this in about >>> 20 >>> seconds. Why is mkcls taking so long? Is it doing something complicated >>> that >>> I have missed here? >>> >>> James >>> >>>> -phi >>>> >>>> On Mon, Aug 31, 2009 at 4:39 PM, James Read<[email protected]> >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> does anyone know what step 1 of the moses training script does other >>>>> than produce the dictionaries and the numerical sentences that enable >>>>> GIZA++ to do its job. The reason I ask is that on my machine step 1 >>>>> takes just over 70 mins for en-fr Europarl corpus. >>>>> >>>>> My optimised version of data preparation and EM IBM Model 1 completes >>>>> is 121 seconds for five iterations of EM, that's just over 2 minutes. >>>>> Before publishing these results I just wanted to make sure there's >>>>> nothing I've missed about step 1 of the training process. Does it do >>>>> anything at all that influences GIZA++ other than preparing the >>>>> digital sentences? >>>>> >>>>> James >>>>> >>>>> -- >>>>> The University of Edinburgh is a charitable body, registered in >>>>> Scotland, with registration number SC005336. >>>>> >>>>> >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> [email protected] >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> >>>> >>>> >>> >>> >>> >>> -- >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> >>> >>> >> >> > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
