Re: [Moses-support] My phrase-table.tgz is 20-bytes long

Tom Hoar Wed, 25 Feb 2015 02:41:05 -0800

Alexander,

If your MGIZA word alignment .gz files are empty, the error is happening 
in step 2. Errors there aren't trapped and the system continues running. 
Therefore, the outputs of steps 3 (alignment file), 4 (lex files) & 5 
(extract files) are all garbage. If the word alignment files are ok and 
the extract files are missing, you probably ran out of hard drive space, 
as Barry suggested.


Running for 10 days on a 40-core configuration is a lot to manage. It 
sounds like a large corpus. Have you run a successful training session 
on a sample subset of your data? I would suggest extracting a random 
sample of ~15,000 pairs and run your configuration with -mgiza-cpus 8  & 
-cores 8. It should take about 30 minutes to run and you shouldn't have 
any disk space problems. Work out any bugs in your corpus prep and/or 
runtime with this smaller subset. Then, scale up to your full-sized 
corpus. With large corpora that run 10 days, you might need several 
hundred gigabytes of available space for temp files in your final output 
folder, i.e. not /tmp.



On 02/25/2015 05:19 PM, Barry Haddow wrote:
> Hi Alexander,
>
> It looks like something went wrong at the extract stage. If you could
> make your training.out available then we can look for clues.
>
> Could the system have run out of disk space, either in the working
> directory or in /tmp? A lot of space is required to build the extract
> files and phrase tables.
>
> cheers - Barry
>
> On 25/02/15 05:32, Александр Паньшин wrote:
>> Ok, I've started from scratch. I'm pretty sure that I worked with
>> corpus such a way:
>>
>> 1. I tokenized the initial corpuses with tokenizer.perl. Learned
>> numbers of lines caused any errors and warnings
>> 2. Deleted these lines from both files using sed
>> 3. Tokenized the files again. No errors
>> 5. Created truecase-model and truecases the files.
>> 6. Deleted too long lines by using clean-corpus-n.perl 1 50
>>
>> Started translation model creation process by:
>>
>>   nohup nice /opt/moses/scripts/training/train-model.perl --parallel
>> -mgiza -mgiza-cpus 40 -cores 40 -root-dir train -corpus
>> ~/corpus/ru-en.clean -f ru -e en -alignment grow-diag-final-and
>> -reordering msd-bidirectional-fe -lm 0:3:$HOME/lm/ru-en.arpa.en:8
>> -external-bin-dir /opt/moses/mgiza >& training.out &
>>
>> After ten days of waiting I have 20-bytes long phraze-table.tgz again!
>> What I'm doing wrong?
>>
>> I have both ru-en and en-ru A3.final.gz files,
>> aligned-grow-diag-final.and, lex.e2f, lex.f2e of quite good size, but
>> empty phrase-table, extract.*.sorted.gz and reordering table.
>>
>> I'm still having no idea what and why goes wrong:(
>>
>> 2015-02-14 21:54 GMT+07:00 Kenneth Heafield <[email protected]
>> <mailto:[email protected]>>:
>>
>>      Sign my petition to add return code checking to train-model.perl.
>>
>>      On 02/14/2015 09:33 AM, Tom Hoar wrote:
>>      > An empty phrase-table.gz file is usually the result of an
>>      ill-prepared
>>      > training corpus. Make sure you run the final corpus through
>>      > clean-corpus-n.perl.
>>      >
>>      >
>>      >
>>      > On 02/14/2015 09:19 PM, Александр Паньшин wrote:
>>      >> Hello, everybody!
>>      >>
>>      >> I have a problem with moses. I created big parallel corpus by
>>      >> concatenating a bunch of existing corpuses on
>>      >> http://opus.lingfil.uu.se. After that I cleaned up results (while
>>      >> creating tokens script reported some errors. I deleted error-prone
>>      >> rows from both of parts).
>>      >>
>>      >> Then I started to train translation model using mgiza with such an
>>      >> executable:
>>      >>
>>      >> nohup nice /opt/moses/scripts/training/train-model.perl --parallel
>>      >> -mgiza -mgiza-cpus 20 -cores 20 -root-dir train -corpus
>>      >> ~/corpus/ru-en.clean -f ru -e en -alignment grow-diag-final-and
>>      >> -reordering msd-bidirectional-fe -lm 0:3:$HOME/lm/ru-en.arpa.en:8
>>      >> -external-bin-dir /opt/moses/mgiza >& training.out &
>>      >>
>>      >> After a week of work I have this in the end of training.out:
>>      >> (7) learn reordering model @ Sun Feb  8 15:30:35 MSK 2015
>>      >> (7.1) [no factors] learn reordering model @ Sun Feb  8 15:30:35
>>      MSK 2015
>>      >> (7.2) building tables @ Sun Feb  8 15:30:35 MSK 2015
>>      >> Executing: /opt/moses/scripts/../bin/lexical-reordering-score
>>      >> /home/adminadmin/working/train/model/extract.o.sorted.gz 0.5
>>      >> /home/adminadmin/working/train/model/reordering-table. --model "wbe
>>      >> msd wbe-msd-bidirectional-fe"
>>      >> Lexical Reordering Scorer
>>      >> scores lexical reordering models of several types (hierarchical,
>>      >> phrase-based and word-based-extraction
>>      >> (8) learn generation model @ Sun Feb  8 15:30:35 MSK 2015
>>      >>   no generation model requested, skipping step
>>      >> (9) create moses.ini @ Sun Feb  8 15:30:35 MSK 2015
>>      >>
>>      >> There is a bunch of files in ~/working/train folder. Looks like
>>      >> everything is ok, except the tiny problem: phrase-table.tgz has
>>      size
>>      >> of 20 bytes. And, of course, it's not usable at all!
>>      >>
>>      >> Can somebody help and give me a direction where to dig?
>>      >>
>>      >>
>>      >> _______________________________________________
>>      >> Moses-support mailing list
>>      >> [email protected] <mailto:[email protected]>
>>      >> http://mailman.mit.edu/mailman/listinfo/moses-support
>>      >
>>      >
>>      >
>>      > _______________________________________________
>>      > Moses-support mailing list
>>      > [email protected] <mailto:[email protected]>
>>      > http://mailman.mit.edu/mailman/listinfo/moses-support
>>      >
>>      _______________________________________________
>>      Moses-support mailing list
>>      [email protected] <mailto:[email protected]>
>>      http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] My phrase-table.tgz is 20-bytes long

Reply via email to