thanks ventsis,
i think most of the improvements in your branch are now in the main
Moses branch, but done in a different way. For instance, disk IO is
reduced in yours by using named pipes that is used to unzip files. In
moses now, the zipped files are read directly by the executables,
rather than unzipping 1st.
There's also support for parallelism in most stages of training
If deadyaga is running out of space during extract, it's limey he'll
run out of space again at a later point, even with all the space
optimisation turned on. The surest way is to run it on a bigger disk
Hieu Hoang
Research Associate (until March 2015)
** searching for interesting commercial MT position **
University of Edinburgh
http://www.hoang.co.uk/hieu
On 2 March 2015 at 07:45, "Венцислав Жечев (Ventsislav Zhechev)"
<[email protected] <mailto:[email protected]>>
wrote:
Hi all,
I’m chiming in a bit late on this conversation, but thought what
I have could help.
At Autodesk we modified the training process a few years back to
avoid the necessity of using so much temp disk space and to
reduce the overall disk I/O during training—all with the premise
that in a stable environment we don’t care about the intermediary
files.
Our code is available on GitHub at
https://github.com/venyz/ADSKMosesTraining
Have a look specifically at
https://github.com/venyz/ADSKMosesTraining/blob/master/scripts/training/train-model.perl
around lines 500–600, where we launch a bunch of processing steps
in parallel setting up the data transfer between tools via unix
named pipes. For this to work properly, we also had to modify
some of the other training workflow tools, but never had the time
and resources to reintegrate our changes in the Moses master (at
the time Moses wasn’t even on Git yet).
Though I doubt you can use this directly with the current moses
training tools, I hope this gives you inspiration and helps solve
Alexander’s disk space issue.
Also, Tom you said that you’re working on updating
train-model.perl anyway, so probably our modifications could be
something you can consider using.
For more information on what we actually did and a numerical
comparison of the disk usage statistics with and without our
modifications, check Section 2.1 in
Zhechev, Ventsislav. 2012. Machine Translation Infrastructure and
Post-editing Performance at Autodesk. In Proceedings of the AMTA
2012 Workshop on Post-editing Technology and Practice (WPTP ’12),
eds. Sharon O’Brien, Michel Simard and Lucia Specia. San Diego, CA.
(http://amta2012.amtaweb.org/AMTA2012Files/html/5/5_paper.pdf)
Cheers,
Ventzi
–––––––
Dr. Ventsislav Zhechev
Computational Linguist, Certified ScrumMaster®
Platform Architecture and Technologies
Localisation Services
MAIN +41 32 723 91 22 <tel:%2B41%2032%20723%2091%2022>
FAX +41 32 723 93 99 <tel:%2B41%2032%20723%2093%2099>
http://VentsislavZhechev.eu
Autodesk, Inc.
Rue de Puits-Godet 6
2000 Neuchâtel, Switzerland
www.autodesk.com <http://www.autodesk.com>
27.02.2015 г., в 12:49, [email protected]
<mailto:[email protected]> написал(а):
Message: 3
Date: Fri, 27 Feb 2015 08:25:01 +0700
From: Tom Hoar <[email protected]
<mailto:[email protected]>>
Subject: Re: [Moses-support] My phrase-table.tgz is 20-bytes long
To: [email protected] <mailto:[email protected]>
Message-ID: <[email protected]
<mailto:[email protected]>>
Content-Type: text/plain; charset="utf-8"
The temp space challenge with train-model.perl is not only how
much temp
space is needed, but also where are the temp files created?
train-model.perl seems unpredictable about where temp files are
placed.
In general, the /tmp folder goes unused unless you specify it
with the
"--temp-dir" option. Even when specified, only the sort
functions use
the configuration it. Without setting the option,e sort temp
files go to
the "--model-dir" folder, which by default is "--root-dir / model".
Finally, if not specified, --root-dir defaults to the current
working
directory (i.e. cwd or ".").
That said, the temp folder for extraction step 3 is different.
Its temp
files default to a newly-created "tmp" folder under the parent
directory
of the path value set with "--extract-file". If "--extract-file"
is not
set, the default is "--model-dir / extract", which also follows the
"--model-dir" path defined above.
Because of this complexity, we advise our customers to use hardware
configuration that relies on one large root folder partition...
i.e. 1
or 2 TB mounted as "/" root without any other mounts. This gives
Moses
one contiguous pool of temp storage independent of the
configuration
complexities. I acknowledge this recommendation violates normal
system
administrators' configurations, but without significant changes to
train-model.perl, this is our best recommendation.
On 02/27/2015 04:28 AM, Barry Haddow wrote:
Hi Alexander
From the error logs, it looks as though alignment went fine, the
training pipeline reports 24860460 lines of aligned bitext.
Since the
extract files were empty, I'd suggest that extraction crashed,
and the
most likely is that it ran out of disk. I'm not sure what
happened to
the error messages.
For 25M sentence pairs, the final phrase table could easily be
30G and
the intermediate files are larger. You probably need more like
500G to
be safe.
I would follow Tom's advice and start with a much smaller
corpus to
see how the process works. Also, for the full corpus, you could
look
in to fast_align (https://github.com/clab/fast_align) for
alignment as
it is much faster than mgiza (e.g. 2 days versus 2 weeks), and
use EMS
for large jobs since it's much easier to restart a failed step.
cheers - Barry
On 26/02/15 15:06, ????????? ??????? wrote:
Hi Barry!
Here you can download training.out
https://www.dropbox.com/s/d0f0n99x4wbw3mo/training.out.gz?dl=1
I have about 50 Gb of free space in working dir.
2015-02-25 17:19 GMT+07:00 Barry Haddow
<[email protected] <mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>:
Hi Alexander,
It looks like something went wrong at the extract stage. If you
could make your training.out available then we can look for
clues.
Could the system have run out of disk space, either in the
working directory or in /tmp? A lot of space is required to
build
the extract files and phrase tables.
cheers - Barry
_______________________________________________
Moses-support mailing list
[email protected] <mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150227/2f4f190f/attachment-0001.htm
_______________________________________________
Moses-support mailing list
[email protected] <mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support