Re: [Moses-support] My phrase-table.tgz is 20-bytes long

Kenneth Heafield Mon, 02 Mar 2015 06:28:01 -0800

BTW there should really be a libpigz.


On 03/02/2015 08:51 AM, Marcin Junczys-Dowmunt wrote:
> When combined with pigz (parallel gzip), the piped approach is probably
> faster. Just saying.
> 
> W dniu 02.03.2015 o 14:48, Hieu Hoang pisze:
>> thanks ventsis,
>>
>> i think most of the improvements in your branch are now in the main
>> Moses branch, but done in a different way. For instance, disk IO is
>> reduced in yours by using named pipes that is used to unzip files. In
>> moses now, the zipped files are read directly by the executables,
>> rather than unzipping 1st.
>>
>> There's also support for parallelism in most stages of training
>>
>> If deadyaga is running out of space during extract, it's limey he'll
>> run out of space again at a later point, even with all the space
>> optimisation turned on. The surest way is to run it on a bigger disk
>>
>> Hieu Hoang
>> Research Associate (until March 2015)
>> ** searching for interesting commercial MT position **
>> University of Edinburgh
>> http://www.hoang.co.uk/hieu
>>
>>
>> On 2 March 2015 at 07:45, "Венцислав Жечев (Ventsislav Zhechev)"
>> <[email protected] <mailto:[email protected]>>
>> wrote:
>>
>>     Hi all,
>>     I’m chiming in a bit late on this conversation, but thought what I
>>     have could help.
>>
>>     At Autodesk we modified the training process a few years back to
>>     avoid the necessity of using so much temp disk space and to reduce
>>     the overall disk I/O during training—all with the premise that in
>>     a stable environment we don’t care about the intermediary files.
>>
>>     Our code is available on GitHub
>>     at https://github.com/venyz/ADSKMosesTraining
>>     Have a look specifically
>>     at 
>> https://github.com/venyz/ADSKMosesTraining/blob/master/scripts/training/train-model.perl
>>     around lines 500–600, where we launch a bunch of processing steps
>>     in parallel setting up the data transfer between tools via unix
>>     named pipes. For this to work properly, we also had to modify some
>>     of the other training workflow tools, but never had the time and
>>     resources to reintegrate our changes in the Moses master (at the
>>     time Moses wasn’t even on Git yet).
>>     Though I doubt you can use this directly with the current moses
>>     training tools, I hope this gives you inspiration and helps solve
>>     Alexander’s disk space issue.
>>
>>     Also, Tom you said that you’re working on updating
>>     train-model.perl anyway, so probably our modifications could be
>>     something you can consider using.
>>
>>
>>     For more information on what we actually did and a numerical
>>     comparison of the disk usage statistics with and without our
>>     modifications, check Section 2.1 in
>>     Zhechev, Ventsislav. 2012. Machine Translation Infrastructure and
>>     Post-editing Performance at Autodesk. In Proceedings of the AMTA
>>     2012 Workshop on Post-editing Technology and Practice (WPTP ’12),
>>     eds. Sharon O’Brien, Michel Simard and Lucia Specia. San Diego, CA. 
>>     (http://amta2012.amtaweb.org/AMTA2012Files/html/5/5_paper.pdf)
>>
>>
>>     Cheers,
>>
>>     Ventzi
>>
>>     –––––––
>>     Dr. Ventsislav Zhechev
>>     Computational Linguist, Certified ScrumMaster®
>>     Platform Architecture and Technologies
>>     Localisation Services
>>
>>     MAIN +41 32 723 91 22 <tel:%2B41%2032%20723%2091%2022>
>>     FAX +41 32 723 93 99 <tel:%2B41%2032%20723%2093%2099>
>>
>>     http://VentsislavZhechev.eu
>>
>>     Autodesk, Inc.
>>     Rue de Puits-Godet 6
>>     2000 Neuchâtel, Switzerland
>>     www.autodesk.com <http://www.autodesk.com>
>>
>>
>>
>>
>>>     27.02.2015 г., в 12:49, [email protected]
>>>     <mailto:[email protected]> написал(а):
>>>
>>>     Message: 3
>>>     Date: Fri, 27 Feb 2015 08:25:01 +0700
>>>     From: Tom Hoar <[email protected]
>>>     <mailto:[email protected]>>
>>>     Subject: Re: [Moses-support] My phrase-table.tgz is 20-bytes long
>>>     To: [email protected] <mailto:[email protected]>
>>>     Message-ID: <[email protected]
>>>     <mailto:[email protected]>>
>>>     Content-Type: text/plain; charset="utf-8"
>>>
>>>
>>>     The temp space challenge with train-model.perl is not only how
>>>     much temp 
>>>     space is needed, but also where are the temp files created? 
>>>     train-model.perl seems unpredictable about where temp files are
>>>     placed.
>>>
>>>     In general, the /tmp folder goes unused unless you specify it
>>>     with the 
>>>     "--temp-dir" option. Even when specified, only the sort functions
>>>     use 
>>>     the configuration it. Without setting the option,e sort temp
>>>     files go to 
>>>     the "--model-dir" folder, which by default is "--root-dir / model". 
>>>     Finally, if not specified, --root-dir defaults to the current
>>>     working 
>>>     directory (i.e. cwd or ".").
>>>
>>>     That said, the temp folder for extraction step 3 is different.
>>>     Its temp 
>>>     files default to a newly-created "tmp" folder under the parent
>>>     directory 
>>>     of the path value set with "--extract-file". If "--extract-file"
>>>     is not 
>>>     set, the default is "--model-dir / extract", which also follows the 
>>>     "--model-dir" path defined above.
>>>
>>>     Because of this complexity, we advise our customers to use hardware 
>>>     configuration that relies on one large root folder partition...
>>>     i.e. 1 
>>>     or 2 TB mounted as "/" root without any other mounts. This gives
>>>     Moses 
>>>     one contiguous pool of temp storage independent of the configuration 
>>>     complexities. I acknowledge this recommendation violates normal
>>>     system 
>>>     administrators' configurations, but without significant changes to 
>>>     train-model.perl, this is our best recommendation.
>>>
>>>
>>>
>>>     On 02/27/2015 04:28 AM, Barry Haddow wrote:
>>>>     Hi Alexander
>>>>
>>>>     From the error logs, it looks as though alignment went fine, the 
>>>>     training pipeline reports 24860460 lines of aligned bitext.
>>>>     Since the 
>>>>     extract files were empty, I'd suggest that extraction crashed,
>>>>     and the 
>>>>     most likely is that it ran out of disk. I'm not sure what
>>>>     happened to 
>>>>     the error messages.
>>>>
>>>>     For 25M sentence pairs, the final phrase table could easily be
>>>>     30G and 
>>>>     the intermediate files are larger. You probably need more like
>>>>     500G to 
>>>>     be safe.
>>>>
>>>>     I would follow Tom's advice and start with a much smaller corpus to 
>>>>     see how the process works. Also, for the full corpus, you could
>>>>     look 
>>>>     in to fast_align (https://github.com/clab/fast_align) for
>>>>     alignment as 
>>>>     it is much faster than mgiza (e.g. 2 days versus 2 weeks), and
>>>>     use EMS 
>>>>     for large jobs since it's much easier to restart a failed step.
>>>>
>>>>     cheers - Barry
>>>>
>>>>     On 26/02/15 15:06, ????????? ??????? wrote:
>>>>>
>>>>>     Hi Barry!
>>>>>
>>>>>     Here you can download training.out 
>>>>>     https://www.dropbox.com/s/d0f0n99x4wbw3mo/training.out.gz?dl=1
>>>>>
>>>>>     I have about 50 Gb of free space in working dir.
>>>>>
>>>>>
>>>>>     2015-02-25 17:19 GMT+07:00 Barry Haddow
>>>>>     <[email protected] <mailto:[email protected]> 
>>>>>     <mailto:[email protected]
>>>>>     <mailto:[email protected]>>>:
>>>>>
>>>>>        Hi Alexander,
>>>>>
>>>>>        It looks like something went wrong at the extract stage. If you
>>>>>        could make your training.out available then we can look for
>>>>>     clues.
>>>>>
>>>>>        Could the system have run out of disk space, either in the
>>>>>        working directory or in /tmp? A lot of space is required to
>>>>>     build
>>>>>        the extract files and phrase tables.
>>>>>
>>>>>        cheers - Barry
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>     _______________________________________________
>>>>     Moses-support mailing list
>>>>     [email protected] <mailto:[email protected]>
>>>>     http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>     -------------- next part --------------
>>>     An HTML attachment was scrubbed...
>>>     URL: 
>>> http://mailman.mit.edu/mailman/private/moses-support/attachments/20150227/2f4f190f/attachment-0001.htm
>>
>>
>>     _______________________________________________
>>     Moses-support mailing list
>>     [email protected] <mailto:[email protected]>
>>     http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
> 
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] My phrase-table.tgz is 20-bytes long

Reply via email to