hi...in plain giza the training runs good....u right this is a small corpus
so it does no make much of a difference...but evetually i am going to use
big corpuses so i have to learn those little tweaks for mgiza :-) for now I
am about to start the tunning following
http://www.statmt.org/wmt07/baseline.html tutorial

1. I do not understand why are we using different corpuses for tuning (
dev2006.fr dev2006.en) than from the original corpuses (europarl). For what
i understand the input corpus (dev2006.fr) is what i am going to be using to
be translated when i run the system on Development test set

     1.1 why do i need it in tuning (dev2006.fr --> in theory is a .doc i do
not know yet, it will be the doc to be translated)

     1.2 the same for dev2006.en...why do i need it in tuning...i already
have the training results

     1.3 is the purpose of this corpuses to improve the training results..if
its so why not runing them on training?

2. is tuning memory-bound or cpu-bound.....increasing my memory will
decrease tuning time? I read i can tweak tuning (phrase table model, lm,
dist model, word penalty) for a faster results but the quality of the
translation will not be as good as default settings...did i understand that
correctly?

3. where is dev2006. fr en.es


thanks a lot
Roberto





On Thu, Jan 27, 2011 at 11:38 AM, Hieu Hoang <[email protected]> wrote:

> just a personally preference - you should try using plain giza before using
> mgiza. MGiza may have perculiarities that people may not be aware of. You
> don't need it anyway if you have a small corpus
>
> On 27 January 2011 23:22, Roberto Rios <[email protected]> wrote:
>
>> thank you for your previous answer..it was a lot of help....I am new at
>> this and trying to leartn how to use this tool.....I am able to run mgiza..I
>> am using a small corpus now for testing (news-commentary en es). I get this
>> error after model4 iteration6:
>>
>> ERROR: Giza did not produce the output file
>> /home/roberto/demo/tools/working-dir/giza.es-en/es-en.A3.final. Is your
>> corpus clean (reasonably-sized sentences)? at ./train-model.perl line 1013.
>>
>> 1. corpuses were tokenized.cleaned and lowercased...what could it be
>> wrong?
>>
>> 2. why do i have to run so many iterations per model (m1=5 hmm=5,
>> H333444:  Viterbi Training) and what is the difference between the models?
>> where can i find good information about it?
>>
>> thanks
>> Roberto
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to