Many thanks. Just what I needed.

James

Quoting Philipp Koehn <[email protected]>:

> Hi,
>
> yes, you are missing this here:
> "An Efficient Method for Determining Bilingual Word Classes"
> http://www.aclweb.org/anthology/E99-1010
>
> The numbers in the classes files are not counts, there are word classes.
>
> -phi
>
> On Tue, Sep 1, 2009 at 4:24 PM, James Read<[email protected]> wrote:
>> Hi,
>>
>> Quoting Philipp Koehn <[email protected]>:
>>
>>> Hi,
>>>
>>> the computationally expensive part are the *.classes files.
>>>
>>
>> again, these files just have two columns. A string in the first column and
>> what looks like a count in the second column. Why should counting strings
>> take so long? Am I missing something here?
>>
>> James
>>
>>> -phi
>>>
>>> On Tue, Sep 1, 2009 at 3:59 PM, James Read<[email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Quoting Philipp Koehn <[email protected]>:
>>>>
>>>>> Hi,
>>>>>
>>>>> yes, it is correct that step 1 is doing just the data preparation for
>>>>> GIZA++.
>>>>> The most time-consuming step is running mkcls to creake the classes
>>>>> for the relative distortion models.
>>>>>
>>>>
>>>> Do you mean the *.vcb files that are created in Step 1? These just look
>>>> like
>>>> dictionary files with 3 fields a) a numeric ID, b) the word entry, c) the
>>>> frequency of the string. My make_dictionary function does this in about
>>>> 20
>>>> seconds. Why is mkcls taking so long? Is it doing something complicated
>>>> that
>>>> I have missed here?
>>>>
>>>> James
>>>>
>>>>> -phi
>>>>>
>>>>> On Mon, Aug 31, 2009 at 4:39 PM, James Read<[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> does anyone know what step 1 of the moses training script does other
>>>>>> than produce the dictionaries and the numerical sentences that enable
>>>>>> GIZA++ to do its job. The reason I ask is that on my machine step 1
>>>>>> takes just over 70 mins for en-fr Europarl corpus.
>>>>>>
>>>>>> My optimised version of data preparation and EM IBM Model 1 completes
>>>>>> is 121 seconds for five iterations of EM, that's just over 2 minutes.
>>>>>> Before publishing these results I just wanted to make sure there's
>>>>>> nothing I've missed about step 1 of the training process. Does it do
>>>>>> anything at all that influences GIZA++ other than preparing the
>>>>>> digital sentences?
>>>>>>
>>>>>> James
>>>>>>
>>>>>> --
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>> Scotland, with registration number SC005336.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> [email protected]
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>>
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to