There 23.5 million lines in cleaned-up corpus.

Thanks for advices. I'll try this.


2015-02-25 17:37 GMT+07:00 Tom Hoar <[email protected]>:

> Alexander,
>
> If your MGIZA word alignment .gz files are empty, the error is happening
> in step 2. Errors there aren't trapped and the system continues running.
> Therefore, the outputs of steps 3 (alignment file), 4 (lex files) & 5
> (extract files) are all garbage. If the word alignment files are ok and
> the extract files are missing, you probably ran out of hard drive space,
> as Barry suggested.
>
> Running for 10 days on a 40-core configuration is a lot to manage. It
> sounds like a large corpus. Have you run a successful training session
> on a sample subset of your data? I would suggest extracting a random
> sample of ~15,000 pairs and run your configuration with -mgiza-cpus 8  &
> -cores 8. It should take about 30 minutes to run and you shouldn't have
> any disk space problems. Work out any bugs in your corpus prep and/or
> runtime with this smaller subset. Then, scale up to your full-sized
> corpus. With large corpora that run 10 days, you might need several
> hundred gigabytes of available space for temp files in your final output
> folder, i.e. not /tmp.
>
>
>
> On 02/25/2015 05:19 PM, Barry Haddow wrote:
> > Hi Alexander,
> >
> > It looks like something went wrong at the extract stage. If you could
> > make your training.out available then we can look for clues.
> >
> > Could the system have run out of disk space, either in the working
> > directory or in /tmp? A lot of space is required to build the extract
> > files and phrase tables.
> >
> > cheers - Barry
> >
> > On 25/02/15 05:32, Александр Паньшин wrote:
> >> Ok, I've started from scratch. I'm pretty sure that I worked with
> >> corpus such a way:
> >>
> >> 1. I tokenized the initial corpuses with tokenizer.perl. Learned
> >> numbers of lines caused any errors and warnings
> >> 2. Deleted these lines from both files using sed
> >> 3. Tokenized the files again. No errors
> >> 5. Created truecase-model and truecases the files.
> >> 6. Deleted too long lines by using clean-corpus-n.perl 1 50
> >>
> >> Started translation model creation process by:
> >>
> >>   nohup nice /opt/moses/scripts/training/train-model.perl --parallel
> >> -mgiza -mgiza-cpus 40 -cores 40 -root-dir train -corpus
> >> ~/corpus/ru-en.clean -f ru -e en -alignment grow-diag-final-and
> >> -reordering msd-bidirectional-fe -lm 0:3:$HOME/lm/ru-en.arpa.en:8
> >> -external-bin-dir /opt/moses/mgiza >& training.out &
> >>
> >> After ten days of waiting I have 20-bytes long phraze-table.tgz again!
> >> What I'm doing wrong?
> >>
> >> I have both ru-en and en-ru A3.final.gz files,
> >> aligned-grow-diag-final.and, lex.e2f, lex.f2e of quite good size, but
> >> empty phrase-table, extract.*.sorted.gz and reordering table.
> >>
> >> I'm still having no idea what and why goes wrong:(
> >>
> >> 2015-02-14 21:54 GMT+07:00 Kenneth Heafield <[email protected]
> >> <mailto:[email protected]>>:
> >>
> >>      Sign my petition to add return code checking to train-model.perl.
> >>
> >>      On 02/14/2015 09:33 AM, Tom Hoar wrote:
> >>      > An empty phrase-table.gz file is usually the result of an
> >>      ill-prepared
> >>      > training corpus. Make sure you run the final corpus through
> >>      > clean-corpus-n.perl.
> >>      >
> >>      >
> >>      >
> >>      > On 02/14/2015 09:19 PM, Александр Паньшин wrote:
> >>      >> Hello, everybody!
> >>      >>
> >>      >> I have a problem with moses. I created big parallel corpus by
> >>      >> concatenating a bunch of existing corpuses on
> >>      >> http://opus.lingfil.uu.se. After that I cleaned up results
> (while
> >>      >> creating tokens script reported some errors. I deleted
> error-prone
> >>      >> rows from both of parts).
> >>      >>
> >>      >> Then I started to train translation model using mgiza with such
> an
> >>      >> executable:
> >>      >>
> >>      >> nohup nice /opt/moses/scripts/training/train-model.perl
> --parallel
> >>      >> -mgiza -mgiza-cpus 20 -cores 20 -root-dir train -corpus
> >>      >> ~/corpus/ru-en.clean -f ru -e en -alignment grow-diag-final-and
> >>      >> -reordering msd-bidirectional-fe -lm
> 0:3:$HOME/lm/ru-en.arpa.en:8
> >>      >> -external-bin-dir /opt/moses/mgiza >& training.out &
> >>      >>
> >>      >> After a week of work I have this in the end of training.out:
> >>      >> (7) learn reordering model @ Sun Feb  8 15:30:35 MSK 2015
> >>      >> (7.1) [no factors] learn reordering model @ Sun Feb  8 15:30:35
> >>      MSK 2015
> >>      >> (7.2) building tables @ Sun Feb  8 15:30:35 MSK 2015
> >>      >> Executing: /opt/moses/scripts/../bin/lexical-reordering-score
> >>      >> /home/adminadmin/working/train/model/extract.o.sorted.gz 0.5
> >>      >> /home/adminadmin/working/train/model/reordering-table. --model
> "wbe
> >>      >> msd wbe-msd-bidirectional-fe"
> >>      >> Lexical Reordering Scorer
> >>      >> scores lexical reordering models of several types (hierarchical,
> >>      >> phrase-based and word-based-extraction
> >>      >> (8) learn generation model @ Sun Feb  8 15:30:35 MSK 2015
> >>      >>   no generation model requested, skipping step
> >>      >> (9) create moses.ini @ Sun Feb  8 15:30:35 MSK 2015
> >>      >>
> >>      >> There is a bunch of files in ~/working/train folder. Looks like
> >>      >> everything is ok, except the tiny problem: phrase-table.tgz has
> >>      size
> >>      >> of 20 bytes. And, of course, it's not usable at all!
> >>      >>
> >>      >> Can somebody help and give me a direction where to dig?
> >>      >>
> >>      >>
> >>      >> _______________________________________________
> >>      >> Moses-support mailing list
> >>      >> [email protected] <mailto:[email protected]>
> >>      >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >>      >
> >>      >
> >>      >
> >>      > _______________________________________________
> >>      > Moses-support mailing list
> >>      > [email protected] <mailto:[email protected]>
> >>      > http://mailman.mit.edu/mailman/listinfo/moses-support
> >>      >
> >>      _______________________________________________
> >>      Moses-support mailing list
> >>      [email protected] <mailto:[email protected]>
> >>      http://mailman.mit.edu/mailman/listinfo/moses-support
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Moses-support mailing list
> >> [email protected]
> >> http://mailman.mit.edu/mailman/listinfo/moses-support
> >
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to