Re: [Moses-support] My phrase-table.tgz is 20-bytes long

Tom Hoar Thu, 26 Feb 2015 17:27:57 -0800

The temp space challenge with train-model.perl is not only how much tempspace is needed, but also where are the temp files created?train-model.perl seems unpredictable about where temp files are placed.

In general, the /tmp folder goes unused unless you specify it with the"--temp-dir" option. Even when specified, only the sort functions usethe configuration it. Without setting the option,e sort temp files go tothe "--model-dir" folder, which by default is "--root-dir / model".Finally, if not specified, --root-dir defaults to the current workingdirectory (i.e. cwd or ".").

That said, the temp folder for extraction step 3 is different. Its tempfiles default to a newly-created "tmp" folder under the parent directoryof the path value set with "--extract-file". If "--extract-file" is notset, the default is "--model-dir / extract", which also follows the"--model-dir" path defined above.

Because of this complexity, we advise our customers to use hardwareconfiguration that relies on one large root folder partition... i.e. 1or 2 TB mounted as "/" root without any other mounts. This gives Mosesone contiguous pool of temp storage independent of the configurationcomplexities. I acknowledge this recommendation violates normal systemadministrators' configurations, but without significant changes totrain-model.perl, this is our best recommendation.




On 02/27/2015 04:28 AM, Barry Haddow wrote:

Hi Alexander
From the error logs, it looks as though alignment went fine, thetraining pipeline reports 24860460 lines of aligned bitext. Since theextract files were empty, I'd suggest that extraction crashed, and themost likely is that it ran out of disk. I'm not sure what happened tothe error messages.
For 25M sentence pairs, the final phrase table could easily be 30G andthe intermediate files are larger. You probably need more like 500G tobe safe.
I would follow Tom's advice and start with a much smaller corpus tosee how the process works. Also, for the full corpus, you could lookin to fast_align (https://github.com/clab/fast_align) for alignment asit is much faster than mgiza (e.g. 2 days versus 2 weeks), and use EMSfor large jobs since it's much easier to restart a failed step.
cheers - Barry

On 26/02/15 15:06, Александр Паньшин wrote:
Hi Barry!
Here you can download training.outhttps://www.dropbox.com/s/d0f0n99x4wbw3mo/training.out.gz?dl=1
I have about 50 Gb of free space in working dir.
2015-02-25 17:19 GMT+07:00 Barry Haddow <[email protected]<mailto:[email protected]>>:
    Hi Alexander,

    It looks like something went wrong at the extract stage. If you
    could make your training.out available then we can look for clues.

    Could the system have run out of disk space, either in the
    working directory or in /tmp? A lot of space is required to build
    the extract files and phrase tables.

    cheers - Barry
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] My phrase-table.tgz is 20-bytes long

Reply via email to