Re: [Moses-support] My phrase-table.tgz is 20-bytes long

Hieu Hoang Mon, 02 Mar 2015 08:39:57 -0800

I don't mind. If u can di it on a branch and keep it in sync with master, I
can test it once you're done
On 2 Mar 2015 15:33, "Tom Hoar" <[email protected]>
wrote:


>  Yes, this is how the "lexical-reordering-score" binary works. We will
> likely update that with our Windows updates (also apply Linux).
>
> Also, the consolidate binary's command-line arguments only supports a file
> system file as output. The train-model.perl script uses /dev/stdout, which
> works but is not cross-platform friendly. Tests show that many other
> binaries hard-codes /dev/stdout. We will update those to standard file
> descriptors. In cases with command-line options, does anyone have
> objections to us updating appropriate binaries to support the standard GNU
> "-" dash to represent file descriptors for stdin/stdout/stderr?
>
>
> On 03/02/2015 09:55 PM, Marcin Junczys-Dowmunt wrote:
>
> There are a few tools that write out gzipped files and they take ages. I
> think the reordering model is first written to an intermediate text file
> (single process) and then compressed from within the binary again with a
> single process instead of using a pipe to gzip/pigz. I believe there are a
> couple of more examples.
>
> W dniu 02.03.2015 o 15:39, Hieu Hoang pisze:
>
> probably not by much. The cpu is saturated anyway when you use parallelism.
>
> pigz is used during the sorting. you can even use it to zip the sort's
> intermediate files by putting this in the train-model.perl options:
>    -sort-compress pigz
>
>
> On 02/03/2015 13:51, Marcin Junczys-Dowmunt wrote:
>
> When combined with pigz (parallel gzip), the piped approach is probably
> faster. Just saying.
>
> W dniu 02.03.2015 o 14:48, Hieu Hoang pisze:
>
> thanks ventsis,
>
>  i think most of the improvements in your branch are now in the main
> Moses branch, but done in a different way. For instance, disk IO is reduced
> in yours by using named pipes that is used to unzip files. In moses now,
> the zipped files are read directly by the executables, rather than
> unzipping 1st.
>
>  There's also support for parallelism in most stages of training
>
>  If deadyaga is running out of space during extract, it's limey he'll run
> out of space again at a later point, even with all the space optimisation
> turned on. The surest way is to run it on a bigger disk
>
>   Hieu Hoang
> Research Associate (until March 2015)
>  ** searching for interesting commercial MT position **
>  University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
> On 2 March 2015 at 07:45, "Венцислав Жечев (Ventsislav Zhechev)" <
> [email protected]> wrote:
>
>> Hi all,
>> I’m chiming in a bit late on this conversation, but thought what I have
>> could help.
>>
>>  At Autodesk we modified the training process a few years back to avoid
>> the necessity of using so much temp disk space and to reduce the overall
>> disk I/O during training—all with the premise that in a stable environment
>> we don’t care about the intermediary files.
>>
>>  Our code is available on GitHub at
>> https://github.com/venyz/ADSKMosesTraining
>> Have a look specifically at
>> https://github.com/venyz/ADSKMosesTraining/blob/master/scripts/training/train-model.perl
>> around lines 500–600, where we launch a bunch of processing steps in
>> parallel setting up the data transfer between tools via unix named pipes.
>> For this to work properly, we also had to modify some of the other training
>> workflow tools, but never had the time and resources to reintegrate our
>> changes in the Moses master (at the time Moses wasn’t even on Git yet).
>> Though I doubt you can use this directly with the current moses training
>> tools, I hope this gives you inspiration and helps solve Alexander’s disk
>> space issue.
>>
>>  Also, Tom you said that you’re working on updating train-model.perl
>> anyway, so probably our modifications could be something you can consider
>> using.
>>
>>
>>  For more information on what we actually did and a numerical comparison
>> of the disk usage statistics with and without our modifications, check
>> Section 2.1 in
>> Zhechev, Ventsislav. 2012. Machine Translation Infrastructure and
>> Post-editing Performance at Autodesk. In Proceedings of the AMTA
>> 2012 Workshop on Post-editing Technology and Practice (WPTP ’12), eds.
>> Sharon O’Brien, Michel Simard and Lucia Specia. San Diego, CA.
>> (http://amta2012.amtaweb.org/AMTA2012Files/html/5/5_paper.pdf)
>>
>>
>>  Cheers,
>>
>> Ventzi
>>
>> –––––––
>> Dr. Ventsislav Zhechev
>> Computational Linguist, Certified ScrumMaster®
>> Platform Architecture and Technologies
>> Localisation Services
>>
>> MAIN +41 32 723 91 22 <%2B41%2032%20723%2091%2022>
>> FAX +41 32 723 93 99 <%2B41%2032%20723%2093%2099>
>>
>> http://VentsislavZhechev.eu
>>
>> Autodesk, Inc.
>> Rue de Puits-Godet 6
>> 2000 Neuchâtel, Switzerland
>> www.autodesk.com
>>
>>
>>
>>
>> 27.02.2015 г., в 12:49, [email protected] написал(а):
>>
>> Message: 3
>> Date: Fri, 27 Feb 2015 08:25:01 +0700
>> From: Tom Hoar <[email protected]>
>> Subject: Re: [Moses-support] My phrase-table.tgz is 20-bytes long
>> To: [email protected]
>> Message-ID: <[email protected]>
>> Content-Type: text/plain; charset="utf-8"
>>
>>
>> The temp space challenge with train-model.perl is not only how much temp
>> space is needed, but also where are the temp files created?
>> train-model.perl seems unpredictable about where temp files are placed.
>>
>> In general, the /tmp folder goes unused unless you specify it with the
>> "--temp-dir" option. Even when specified, only the sort functions use
>> the configuration it. Without setting the option,e sort temp files go to
>> the "--model-dir" folder, which by default is "--root-dir / model".
>> Finally, if not specified, --root-dir defaults to the current working
>> directory (i.e. cwd or ".").
>>
>> That said, the temp folder for extraction step 3 is different. Its temp
>> files default to a newly-created "tmp" folder under the parent directory
>> of the path value set with "--extract-file". If "--extract-file" is not
>> set, the default is "--model-dir / extract", which also follows the
>> "--model-dir" path defined above.
>>
>> Because of this complexity, we advise our customers to use hardware
>> configuration that relies on one large root folder partition... i.e. 1
>> or 2 TB mounted as "/" root without any other mounts. This gives Moses
>> one contiguous pool of temp storage independent of the configuration
>> complexities. I acknowledge this recommendation violates normal system
>> administrators' configurations, but without significant changes to
>> train-model.perl, this is our best recommendation.
>>
>>
>>
>> On 02/27/2015 04:28 AM, Barry Haddow wrote:
>>
>>  Hi Alexander
>>
>> From the error logs, it looks as though alignment went fine, the
>> training pipeline reports 24860460 lines of aligned bitext. Since the
>> extract files were empty, I'd suggest that extraction crashed, and the
>> most likely is that it ran out of disk. I'm not sure what happened to
>> the error messages.
>>
>> For 25M sentence pairs, the final phrase table could easily be 30G and
>> the intermediate files are larger. You probably need more like 500G to
>> be safe.
>>
>> I would follow Tom's advice and start with a much smaller corpus to
>> see how the process works. Also, for the full corpus, you could look
>> in to fast_align (https://github.com/clab/fast_align) for alignment as
>> it is much faster than mgiza (e.g. 2 days versus 2 weeks), and use EMS
>> for large jobs since it's much easier to restart a failed step.
>>
>> cheers - Barry
>>
>>  On 26/02/15 15:06, ????????? ??????? wrote:
>>
>>
>> Hi Barry!
>>
>> Here you can download training.out
>> https://www.dropbox.com/s/d0f0n99x4wbw3mo/training.out.gz?dl=1
>>
>> I have about 50 Gb of free space in working dir.
>>
>>
>> 2015-02-25 17:19 GMT+07:00 Barry Haddow <[email protected]
>> <mailto:[email protected]>>:
>>
>>    Hi Alexander,
>>
>>    It looks like something went wrong at the extract stage. If you
>>    could make your training.out available then we can look for clues.
>>
>>    Could the system have run out of disk space, either in the
>>    working directory or in /tmp? A lot of space is required to build
>>    the extract files and phrase tables.
>>
>>    cheers - Barry
>>
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL:
>> http://mailman.mit.edu/mailman/private/moses-support/attachments/20150227/2f4f190f/attachment-0001.htm
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> _______________________________________________
> Moses-support mailing 
> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
> _______________________________________________
> Moses-support mailing 
> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>
>
> _______________________________________________
> Moses-support mailing 
> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] My phrase-table.tgz is 20-bytes long

Reply via email to