Hi,

GIZA++ has a limit on 100 words per sentence.
It usually makes little sense to include sentences longer
than 60 words in training, since the word alignment is
difficult to compute.

-phi

On Wed, Oct 7, 2009 at 6:37 PM, Danish Contractor
<[email protected]> wrote:
> Hi,
>
> Thanks for the reply. Yes, I did run the clean-corpus-n.perl  script.
> I also had to replace all occurrences  of "|" in the hindi text with another
> character as it seems "|" is of special significance to the scripts.
>
> The "|" is used in the hindi language as a full stop ("." ---  end of
> sentence marker).
>
> Could you please let me know if there is a limit on the max length of
> sentences - I gave a length of 1 - 60 while running the script.
> In addition, is there any limit on the max allowable difference in sentence
> length of the parallel text?
>
> Thanks.
> --Danish
>
> On Wed, Oct 7, 2009 at 6:41 PM, Philipp Koehn <[email protected]> wrote:
>>
>> Hi,
>>
>> the problem lies in the word alignment step (step 3) - you can run the
>> step in
>> isolation to check in more detail about what is going wrong.
>>
>> One common problem with word alignment is that GIZA++ is sensititive
>> to bad data, i.e. empty lines, long sentences, or excessive mismatch
>> in sentence length. The clean-corpus-n.perl script is designed to take
>> care of these problems. Did you run this on your original corpus?
>>
>> -phi
>>
>> On Sun, Oct 4, 2009 at 6:32 AM, Danish Contractor
>> <[email protected]> wrote:
>> > Hi,
>> >
>> > I have compiled Moses,Giza & SRILM on Fedora Core 11 using the steps
>> > described in http://www.statmt.org/moses_steps.html and other moses
>> > support
>> > links.
>> >
>> > While training my parallel corpus of english and hindi (~100,000
>> > sentences
>> > each) I get an error as shown below when i execute:
>> >
>> > nohup nice
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031//training/train-factored-phrase-model.perl
>> > -scripts-root-dir ./tools/moses-scripts/scripts-20091002-0031/ -root-dir
>> > work3 -corpus ./work3/corpus/IRL-clean -f hi2 -e en2 -alignment
>> > grow-diag-final-and -reordering msd-bidirectional-fe -lm
>> > 0:3:/home/danish/FIRE2010/work3/lm/IRL-en.lm >& ./work3/training.out &
>> >
>> > In one step of the training process, I get the following error and the
>> > tools
>> > quits:
>> >
>> > Last few lines of output (training.out) :
>> >
>> > Use of uninitialized value $a in split at
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
>> > line 856.
>> > Use of uninitialized value $a in scalar chomp at
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
>> > line 853.
>> > Use of uninitialized value $a in split at
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
>> > line 856.
>> > Use of uninitialized value $a in scalar chomp at
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
>> > line 853.
>> > Use of uninitialized value $a in split at
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
>> > line 856.
>> > Use of uninitialized value $a in scalar chomp at
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
>> > line 853.
>> > Use of uninitialized value $a in split at
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
>> > line 856.
>> > Use of uninitialized value $a in scalar chomp at
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
>> > line 853.
>> > Use of uninitialized value $a in split at
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
>> > line 856.
>> > Use of uninitialized value $a in scalar chomp at
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
>> > line 853.
>> > Use of uninitialized value $a in split at
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
>> > line 856.
>> >
>> > Saved: ./work3//model/lex.f2e and ./work3//model/lex.e2f
>> > FILE: ./work3/corpus/IRL-clean.en2
>> > FILE: ./work3/corpus/IRL-clean.hi2
>> > FILE: ./work3//model/aligned.grow-diag-final-and
>> > (5) extract phrases @ Sat Oct  3 02:46:00 IST 2009
>> >
>> > ./tools/moses-scripts//scripts-20091002-0031//training/phrase-extract/extract
>> > ./work3/corpus/IRL-clean.en2 ./work3/corpus/IRL-clean.hi2
>> > ./work3//model/aligned.grow-diag-final-and ./work3//model/extract 7
>> > --NoFileLimit orientation
>> > Executing:
>> >
>> > ./tools/moses-scripts//scripts-20091002-0031//training/phrase-extract/extract
>> > ./work3/corpus/IRL-clean.en2 ./work3/corpus/IRL-clean.hi2
>> > ./work3//model/aligned.grow-diag-final-and ./work3//model/extract 7
>> > --NoFileLimit orientation
>> > PhraseExtract v1.4, written by Philipp Koehn
>> > phrase extraction from an aligned parallel corpus
>> > .........Executing: gzip ./work3//model/extract.inv
>> > gzip: ./work3//model/extract.inv: No such file or directory
>> > Exit code: 1
>> > ERROR at
>> >
>> > ./tools/moses-scripts/scripts-20091002-0031/training/train-factored-phrase-model.perl
>> > line 963.
>> >
>> >
>> > My clean sentence files are with the extension hi2 (for hindi) and en2
>> > (for
>> > english).
>> > I have tried solutions available on moses support forums for similar
>> > problems, but they have not helped.
>> >
>> > The following is a listing of the files & folders in my work folder
>> > (work3)
>> >
>> > corpus folder
>> > total 76384
>> > -rw-rw-r--. 1 danish danish 27717737 2009-10-02 23:29 IRL-clean.hi2
>> > -rw-rw-r--. 1 danish danish 11502887 2009-10-02 23:29 IRL-clean.en2
>> > -rw-r--r--. 1 root   root    1781671 2009-10-03 17:44 hi2.vcb.classes
>> > -rw-r--r--. 1 root   root    1579583 2009-10-03 17:44
>> > hi2.vcb.classes.cats
>> > -rw-r--r--. 1 root   root     704087 2009-10-03 17:50 en2.vcb.classes
>> > -rw-r--r--. 1 root   root     534277 2009-10-03 17:50
>> > en2.vcb.classes.cats
>> > -rw-r--r--. 1 root   root    2158362 2009-10-03 17:50 hi2.vcb
>> > -rw-r--r--. 1 root   root    1013926 2009-10-03 17:50 en2.vcb
>> > -rw-r--r--. 1 root   root   15605740 2009-10-03 17:50
>> > hi2-en2-int-train.snt
>> > -rw-r--r--. 1 root   root   15605740 2009-10-03 17:51
>> > en2-hi2-int-train.snt
>> >
>> > giza.en2-hi2 folder
>> > total 124088
>> > -rw-r--r--. 1 root root 109989326 2009-10-03 18:44 en2-hi2.cooc
>> > -rw-r--r--. 1 root root      1651 2009-10-03 18:44 en2-hi2.gizacfg
>> > -rw-r--r--. 1 root root  17070807 2009-10-03 19:22 en2-hi2.A3.final.gz
>> >
>> > giza.hi2-en2 folder
>> > total 124052
>> > -rw-r--r--. 1 root root 110088686 2009-10-03 17:51 hi2-en2.cooc
>> > -rw-r--r--. 1 root root      1651 2009-10-03 17:51 hi2-en2.gizacfg
>> > -rw-r--r--. 1 root root  16928263 2009-10-03 18:43 hi2-en2.A3.final.gz
>> >
>> > lm folder
>> > total 100388
>> > -rw-rw-r--. 1 danish danish 27717737 2009-10-02 23:29 IRL-clean.hi2
>> > -rw-rw-r--. 1 danish danish 11502887 2009-10-02 23:29 IRL-clean.en2
>> > -rw-r--r--. 1 root   root   22834140 2009-10-03 17:29 IRL-en.lm
>> > -rw-r--r--. 1 root   root   40731568 2009-10-03 17:30 IRL-hi.lm
>> >
>> >  model folder
>> > total 7992
>> > -rw-r--r--. 1 root root       0 2009-10-03 19:23
>> > aligned.grow-diag-final-and
>> > -rw-r--r--. 1 root root 4089006 2009-10-03 19:23 lex.f2e
>> > -rw-r--r--. 1 root root 4089006 2009-10-03 19:23 lex.e2f
>> >
>> > I can see the model folder does not contain the extract.inv file which
>> > seems
>> > to cause the error. I have re-done the steps thrice and face the exact
>> > same
>> > error each time.
>> >
>> > I have ensured that the parallel text has been lower cased (for english)
>> > and
>> > cleaned (english & hindi both).
>> > May I request you to kindly help me resolve this issue at the earliest.
>> > Thanks!
>> >
>> > Thank you,
>> > Regards,
>> >
>> > Danish Contractor
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Moses-support mailing list
>> > [email protected]
>> > http://mailman.mit.edu/mailman/listinfo/moses-support
>> >
>> >
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to