There 23.5 million lines in cleaned-up corpus. Thanks for advices. I'll try this.
2015-02-25 17:37 GMT+07:00 Tom Hoar <[email protected]>: > Alexander, > > If your MGIZA word alignment .gz files are empty, the error is happening > in step 2. Errors there aren't trapped and the system continues running. > Therefore, the outputs of steps 3 (alignment file), 4 (lex files) & 5 > (extract files) are all garbage. If the word alignment files are ok and > the extract files are missing, you probably ran out of hard drive space, > as Barry suggested. > > Running for 10 days on a 40-core configuration is a lot to manage. It > sounds like a large corpus. Have you run a successful training session > on a sample subset of your data? I would suggest extracting a random > sample of ~15,000 pairs and run your configuration with -mgiza-cpus 8 & > -cores 8. It should take about 30 minutes to run and you shouldn't have > any disk space problems. Work out any bugs in your corpus prep and/or > runtime with this smaller subset. Then, scale up to your full-sized > corpus. With large corpora that run 10 days, you might need several > hundred gigabytes of available space for temp files in your final output > folder, i.e. not /tmp. > > > > On 02/25/2015 05:19 PM, Barry Haddow wrote: > > Hi Alexander, > > > > It looks like something went wrong at the extract stage. If you could > > make your training.out available then we can look for clues. > > > > Could the system have run out of disk space, either in the working > > directory or in /tmp? A lot of space is required to build the extract > > files and phrase tables. > > > > cheers - Barry > > > > On 25/02/15 05:32, Александр Паньшин wrote: > >> Ok, I've started from scratch. I'm pretty sure that I worked with > >> corpus such a way: > >> > >> 1. I tokenized the initial corpuses with tokenizer.perl. Learned > >> numbers of lines caused any errors and warnings > >> 2. Deleted these lines from both files using sed > >> 3. Tokenized the files again. No errors > >> 5. Created truecase-model and truecases the files. > >> 6. Deleted too long lines by using clean-corpus-n.perl 1 50 > >> > >> Started translation model creation process by: > >> > >> nohup nice /opt/moses/scripts/training/train-model.perl --parallel > >> -mgiza -mgiza-cpus 40 -cores 40 -root-dir train -corpus > >> ~/corpus/ru-en.clean -f ru -e en -alignment grow-diag-final-and > >> -reordering msd-bidirectional-fe -lm 0:3:$HOME/lm/ru-en.arpa.en:8 > >> -external-bin-dir /opt/moses/mgiza >& training.out & > >> > >> After ten days of waiting I have 20-bytes long phraze-table.tgz again! > >> What I'm doing wrong? > >> > >> I have both ru-en and en-ru A3.final.gz files, > >> aligned-grow-diag-final.and, lex.e2f, lex.f2e of quite good size, but > >> empty phrase-table, extract.*.sorted.gz and reordering table. > >> > >> I'm still having no idea what and why goes wrong:( > >> > >> 2015-02-14 21:54 GMT+07:00 Kenneth Heafield <[email protected] > >> <mailto:[email protected]>>: > >> > >> Sign my petition to add return code checking to train-model.perl. > >> > >> On 02/14/2015 09:33 AM, Tom Hoar wrote: > >> > An empty phrase-table.gz file is usually the result of an > >> ill-prepared > >> > training corpus. Make sure you run the final corpus through > >> > clean-corpus-n.perl. > >> > > >> > > >> > > >> > On 02/14/2015 09:19 PM, Александр Паньшин wrote: > >> >> Hello, everybody! > >> >> > >> >> I have a problem with moses. I created big parallel corpus by > >> >> concatenating a bunch of existing corpuses on > >> >> http://opus.lingfil.uu.se. After that I cleaned up results > (while > >> >> creating tokens script reported some errors. I deleted > error-prone > >> >> rows from both of parts). > >> >> > >> >> Then I started to train translation model using mgiza with such > an > >> >> executable: > >> >> > >> >> nohup nice /opt/moses/scripts/training/train-model.perl > --parallel > >> >> -mgiza -mgiza-cpus 20 -cores 20 -root-dir train -corpus > >> >> ~/corpus/ru-en.clean -f ru -e en -alignment grow-diag-final-and > >> >> -reordering msd-bidirectional-fe -lm > 0:3:$HOME/lm/ru-en.arpa.en:8 > >> >> -external-bin-dir /opt/moses/mgiza >& training.out & > >> >> > >> >> After a week of work I have this in the end of training.out: > >> >> (7) learn reordering model @ Sun Feb 8 15:30:35 MSK 2015 > >> >> (7.1) [no factors] learn reordering model @ Sun Feb 8 15:30:35 > >> MSK 2015 > >> >> (7.2) building tables @ Sun Feb 8 15:30:35 MSK 2015 > >> >> Executing: /opt/moses/scripts/../bin/lexical-reordering-score > >> >> /home/adminadmin/working/train/model/extract.o.sorted.gz 0.5 > >> >> /home/adminadmin/working/train/model/reordering-table. --model > "wbe > >> >> msd wbe-msd-bidirectional-fe" > >> >> Lexical Reordering Scorer > >> >> scores lexical reordering models of several types (hierarchical, > >> >> phrase-based and word-based-extraction > >> >> (8) learn generation model @ Sun Feb 8 15:30:35 MSK 2015 > >> >> no generation model requested, skipping step > >> >> (9) create moses.ini @ Sun Feb 8 15:30:35 MSK 2015 > >> >> > >> >> There is a bunch of files in ~/working/train folder. Looks like > >> >> everything is ok, except the tiny problem: phrase-table.tgz has > >> size > >> >> of 20 bytes. And, of course, it's not usable at all! > >> >> > >> >> Can somebody help and give me a direction where to dig? > >> >> > >> >> > >> >> _______________________________________________ > >> >> Moses-support mailing list > >> >> [email protected] <mailto:[email protected]> > >> >> http://mailman.mit.edu/mailman/listinfo/moses-support > >> > > >> > > >> > > >> > _______________________________________________ > >> > Moses-support mailing list > >> > [email protected] <mailto:[email protected]> > >> > http://mailman.mit.edu/mailman/listinfo/moses-support > >> > > >> _______________________________________________ > >> Moses-support mailing list > >> [email protected] <mailto:[email protected]> > >> http://mailman.mit.edu/mailman/listinfo/moses-support > >> > >> > >> > >> > >> _______________________________________________ > >> Moses-support mailing list > >> [email protected] > >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
