BTW there should really be a libpigz.
On 03/02/2015 08:51 AM, Marcin Junczys-Dowmunt wrote: > When combined with pigz (parallel gzip), the piped approach is probably > faster. Just saying. > > W dniu 02.03.2015 o 14:48, Hieu Hoang pisze: >> thanks ventsis, >> >> i think most of the improvements in your branch are now in the main >> Moses branch, but done in a different way. For instance, disk IO is >> reduced in yours by using named pipes that is used to unzip files. In >> moses now, the zipped files are read directly by the executables, >> rather than unzipping 1st. >> >> There's also support for parallelism in most stages of training >> >> If deadyaga is running out of space during extract, it's limey he'll >> run out of space again at a later point, even with all the space >> optimisation turned on. The surest way is to run it on a bigger disk >> >> Hieu Hoang >> Research Associate (until March 2015) >> ** searching for interesting commercial MT position ** >> University of Edinburgh >> http://www.hoang.co.uk/hieu >> >> >> On 2 March 2015 at 07:45, "Венцислав Жечев (Ventsislav Zhechev)" >> <[email protected] <mailto:[email protected]>> >> wrote: >> >> Hi all, >> I’m chiming in a bit late on this conversation, but thought what I >> have could help. >> >> At Autodesk we modified the training process a few years back to >> avoid the necessity of using so much temp disk space and to reduce >> the overall disk I/O during training—all with the premise that in >> a stable environment we don’t care about the intermediary files. >> >> Our code is available on GitHub >> at https://github.com/venyz/ADSKMosesTraining >> Have a look specifically >> at >> https://github.com/venyz/ADSKMosesTraining/blob/master/scripts/training/train-model.perl >> around lines 500–600, where we launch a bunch of processing steps >> in parallel setting up the data transfer between tools via unix >> named pipes. For this to work properly, we also had to modify some >> of the other training workflow tools, but never had the time and >> resources to reintegrate our changes in the Moses master (at the >> time Moses wasn’t even on Git yet). >> Though I doubt you can use this directly with the current moses >> training tools, I hope this gives you inspiration and helps solve >> Alexander’s disk space issue. >> >> Also, Tom you said that you’re working on updating >> train-model.perl anyway, so probably our modifications could be >> something you can consider using. >> >> >> For more information on what we actually did and a numerical >> comparison of the disk usage statistics with and without our >> modifications, check Section 2.1 in >> Zhechev, Ventsislav. 2012. Machine Translation Infrastructure and >> Post-editing Performance at Autodesk. In Proceedings of the AMTA >> 2012 Workshop on Post-editing Technology and Practice (WPTP ’12), >> eds. Sharon O’Brien, Michel Simard and Lucia Specia. San Diego, CA. >> (http://amta2012.amtaweb.org/AMTA2012Files/html/5/5_paper.pdf) >> >> >> Cheers, >> >> Ventzi >> >> ––––––– >> Dr. Ventsislav Zhechev >> Computational Linguist, Certified ScrumMaster® >> Platform Architecture and Technologies >> Localisation Services >> >> MAIN +41 32 723 91 22 <tel:%2B41%2032%20723%2091%2022> >> FAX +41 32 723 93 99 <tel:%2B41%2032%20723%2093%2099> >> >> http://VentsislavZhechev.eu >> >> Autodesk, Inc. >> Rue de Puits-Godet 6 >> 2000 Neuchâtel, Switzerland >> www.autodesk.com <http://www.autodesk.com> >> >> >> >> >>> 27.02.2015 г., в 12:49, [email protected] >>> <mailto:[email protected]> написал(а): >>> >>> Message: 3 >>> Date: Fri, 27 Feb 2015 08:25:01 +0700 >>> From: Tom Hoar <[email protected] >>> <mailto:[email protected]>> >>> Subject: Re: [Moses-support] My phrase-table.tgz is 20-bytes long >>> To: [email protected] <mailto:[email protected]> >>> Message-ID: <[email protected] >>> <mailto:[email protected]>> >>> Content-Type: text/plain; charset="utf-8" >>> >>> >>> The temp space challenge with train-model.perl is not only how >>> much temp >>> space is needed, but also where are the temp files created? >>> train-model.perl seems unpredictable about where temp files are >>> placed. >>> >>> In general, the /tmp folder goes unused unless you specify it >>> with the >>> "--temp-dir" option. Even when specified, only the sort functions >>> use >>> the configuration it. Without setting the option,e sort temp >>> files go to >>> the "--model-dir" folder, which by default is "--root-dir / model". >>> Finally, if not specified, --root-dir defaults to the current >>> working >>> directory (i.e. cwd or "."). >>> >>> That said, the temp folder for extraction step 3 is different. >>> Its temp >>> files default to a newly-created "tmp" folder under the parent >>> directory >>> of the path value set with "--extract-file". If "--extract-file" >>> is not >>> set, the default is "--model-dir / extract", which also follows the >>> "--model-dir" path defined above. >>> >>> Because of this complexity, we advise our customers to use hardware >>> configuration that relies on one large root folder partition... >>> i.e. 1 >>> or 2 TB mounted as "/" root without any other mounts. This gives >>> Moses >>> one contiguous pool of temp storage independent of the configuration >>> complexities. I acknowledge this recommendation violates normal >>> system >>> administrators' configurations, but without significant changes to >>> train-model.perl, this is our best recommendation. >>> >>> >>> >>> On 02/27/2015 04:28 AM, Barry Haddow wrote: >>>> Hi Alexander >>>> >>>> From the error logs, it looks as though alignment went fine, the >>>> training pipeline reports 24860460 lines of aligned bitext. >>>> Since the >>>> extract files were empty, I'd suggest that extraction crashed, >>>> and the >>>> most likely is that it ran out of disk. I'm not sure what >>>> happened to >>>> the error messages. >>>> >>>> For 25M sentence pairs, the final phrase table could easily be >>>> 30G and >>>> the intermediate files are larger. You probably need more like >>>> 500G to >>>> be safe. >>>> >>>> I would follow Tom's advice and start with a much smaller corpus to >>>> see how the process works. Also, for the full corpus, you could >>>> look >>>> in to fast_align (https://github.com/clab/fast_align) for >>>> alignment as >>>> it is much faster than mgiza (e.g. 2 days versus 2 weeks), and >>>> use EMS >>>> for large jobs since it's much easier to restart a failed step. >>>> >>>> cheers - Barry >>>> >>>> On 26/02/15 15:06, ????????? ??????? wrote: >>>>> >>>>> Hi Barry! >>>>> >>>>> Here you can download training.out >>>>> https://www.dropbox.com/s/d0f0n99x4wbw3mo/training.out.gz?dl=1 >>>>> >>>>> I have about 50 Gb of free space in working dir. >>>>> >>>>> >>>>> 2015-02-25 17:19 GMT+07:00 Barry Haddow >>>>> <[email protected] <mailto:[email protected]> >>>>> <mailto:[email protected] >>>>> <mailto:[email protected]>>>: >>>>> >>>>> Hi Alexander, >>>>> >>>>> It looks like something went wrong at the extract stage. If you >>>>> could make your training.out available then we can look for >>>>> clues. >>>>> >>>>> Could the system have run out of disk space, either in the >>>>> working directory or in /tmp? A lot of space is required to >>>>> build >>>>> the extract files and phrase tables. >>>>> >>>>> cheers - Barry >>>>> >>>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] <mailto:[email protected]> >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> URL: >>> http://mailman.mit.edu/mailman/private/moses-support/attachments/20150227/2f4f190f/attachment-0001.htm >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] <mailto:[email protected]> >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
