Hi,

yes, you are missing this here:
"An Efficient Method for Determining Bilingual Word Classes"
http://www.aclweb.org/anthology/E99-1010

The numbers in the classes files are not counts, there are word classes.

-phi

On Tue, Sep 1, 2009 at 4:24 PM, James Read<[email protected]> wrote:
> Hi,
>
> Quoting Philipp Koehn <[email protected]>:
>
>> Hi,
>>
>> the computationally expensive part are the *.classes files.
>>
>
> again, these files just have two columns. A string in the first column and
> what looks like a count in the second column. Why should counting strings
> take so long? Am I missing something here?
>
> James
>
>> -phi
>>
>> On Tue, Sep 1, 2009 at 3:59 PM, James Read<[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> Quoting Philipp Koehn <[email protected]>:
>>>
>>>> Hi,
>>>>
>>>> yes, it is correct that step 1 is doing just the data preparation for
>>>> GIZA++.
>>>> The most time-consuming step is running mkcls to creake the classes
>>>> for the relative distortion models.
>>>>
>>>
>>> Do you mean the *.vcb files that are created in Step 1? These just look
>>> like
>>> dictionary files with 3 fields a) a numeric ID, b) the word entry, c) the
>>> frequency of the string. My make_dictionary function does this in about
>>> 20
>>> seconds. Why is mkcls taking so long? Is it doing something complicated
>>> that
>>> I have missed here?
>>>
>>> James
>>>
>>>> -phi
>>>>
>>>> On Mon, Aug 31, 2009 at 4:39 PM, James Read<[email protected]>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> does anyone know what step 1 of the moses training script does other
>>>>> than produce the dictionaries and the numerical sentences that enable
>>>>> GIZA++ to do its job. The reason I ask is that on my machine step 1
>>>>> takes just over 70 mins for en-fr Europarl corpus.
>>>>>
>>>>> My optimised version of data preparation and EM IBM Model 1 completes
>>>>> is 121 seconds for five iterations of EM, that's just over 2 minutes.
>>>>> Before publishing these results I just wanted to make sure there's
>>>>> nothing I've missed about step 1 of the training process. Does it do
>>>>> anything at all that influences GIZA++ other than preparing the
>>>>> digital sentences?
>>>>>
>>>>> James
>>>>>
>>>>> --
>>>>> The University of Edinburgh is a charitable body, registered in
>>>>> Scotland, with registration number SC005336.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> [email protected]
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>>
>>
>>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to