I don't mind. If u can di it on a branch and keep it in sync with master, I can test it once you're done On 2 Mar 2015 15:33, "Tom Hoar" <[email protected]> wrote:
> Yes, this is how the "lexical-reordering-score" binary works. We will > likely update that with our Windows updates (also apply Linux). > > Also, the consolidate binary's command-line arguments only supports a file > system file as output. The train-model.perl script uses /dev/stdout, which > works but is not cross-platform friendly. Tests show that many other > binaries hard-codes /dev/stdout. We will update those to standard file > descriptors. In cases with command-line options, does anyone have > objections to us updating appropriate binaries to support the standard GNU > "-" dash to represent file descriptors for stdin/stdout/stderr? > > > On 03/02/2015 09:55 PM, Marcin Junczys-Dowmunt wrote: > > There are a few tools that write out gzipped files and they take ages. I > think the reordering model is first written to an intermediate text file > (single process) and then compressed from within the binary again with a > single process instead of using a pipe to gzip/pigz. I believe there are a > couple of more examples. > > W dniu 02.03.2015 o 15:39, Hieu Hoang pisze: > > probably not by much. The cpu is saturated anyway when you use parallelism. > > pigz is used during the sorting. you can even use it to zip the sort's > intermediate files by putting this in the train-model.perl options: > -sort-compress pigz > > > On 02/03/2015 13:51, Marcin Junczys-Dowmunt wrote: > > When combined with pigz (parallel gzip), the piped approach is probably > faster. Just saying. > > W dniu 02.03.2015 o 14:48, Hieu Hoang pisze: > > thanks ventsis, > > i think most of the improvements in your branch are now in the main > Moses branch, but done in a different way. For instance, disk IO is reduced > in yours by using named pipes that is used to unzip files. In moses now, > the zipped files are read directly by the executables, rather than > unzipping 1st. > > There's also support for parallelism in most stages of training > > If deadyaga is running out of space during extract, it's limey he'll run > out of space again at a later point, even with all the space optimisation > turned on. The surest way is to run it on a bigger disk > > Hieu Hoang > Research Associate (until March 2015) > ** searching for interesting commercial MT position ** > University of Edinburgh > http://www.hoang.co.uk/hieu > > > On 2 March 2015 at 07:45, "Венцислав Жечев (Ventsislav Zhechev)" < > [email protected]> wrote: > >> Hi all, >> I’m chiming in a bit late on this conversation, but thought what I have >> could help. >> >> At Autodesk we modified the training process a few years back to avoid >> the necessity of using so much temp disk space and to reduce the overall >> disk I/O during training—all with the premise that in a stable environment >> we don’t care about the intermediary files. >> >> Our code is available on GitHub at >> https://github.com/venyz/ADSKMosesTraining >> Have a look specifically at >> https://github.com/venyz/ADSKMosesTraining/blob/master/scripts/training/train-model.perl >> around lines 500–600, where we launch a bunch of processing steps in >> parallel setting up the data transfer between tools via unix named pipes. >> For this to work properly, we also had to modify some of the other training >> workflow tools, but never had the time and resources to reintegrate our >> changes in the Moses master (at the time Moses wasn’t even on Git yet). >> Though I doubt you can use this directly with the current moses training >> tools, I hope this gives you inspiration and helps solve Alexander’s disk >> space issue. >> >> Also, Tom you said that you’re working on updating train-model.perl >> anyway, so probably our modifications could be something you can consider >> using. >> >> >> For more information on what we actually did and a numerical comparison >> of the disk usage statistics with and without our modifications, check >> Section 2.1 in >> Zhechev, Ventsislav. 2012. Machine Translation Infrastructure and >> Post-editing Performance at Autodesk. In Proceedings of the AMTA >> 2012 Workshop on Post-editing Technology and Practice (WPTP ’12), eds. >> Sharon O’Brien, Michel Simard and Lucia Specia. San Diego, CA. >> (http://amta2012.amtaweb.org/AMTA2012Files/html/5/5_paper.pdf) >> >> >> Cheers, >> >> Ventzi >> >> ––––––– >> Dr. Ventsislav Zhechev >> Computational Linguist, Certified ScrumMaster® >> Platform Architecture and Technologies >> Localisation Services >> >> MAIN +41 32 723 91 22 <%2B41%2032%20723%2091%2022> >> FAX +41 32 723 93 99 <%2B41%2032%20723%2093%2099> >> >> http://VentsislavZhechev.eu >> >> Autodesk, Inc. >> Rue de Puits-Godet 6 >> 2000 Neuchâtel, Switzerland >> www.autodesk.com >> >> >> >> >> 27.02.2015 г., в 12:49, [email protected] написал(а): >> >> Message: 3 >> Date: Fri, 27 Feb 2015 08:25:01 +0700 >> From: Tom Hoar <[email protected]> >> Subject: Re: [Moses-support] My phrase-table.tgz is 20-bytes long >> To: [email protected] >> Message-ID: <[email protected]> >> Content-Type: text/plain; charset="utf-8" >> >> >> The temp space challenge with train-model.perl is not only how much temp >> space is needed, but also where are the temp files created? >> train-model.perl seems unpredictable about where temp files are placed. >> >> In general, the /tmp folder goes unused unless you specify it with the >> "--temp-dir" option. Even when specified, only the sort functions use >> the configuration it. Without setting the option,e sort temp files go to >> the "--model-dir" folder, which by default is "--root-dir / model". >> Finally, if not specified, --root-dir defaults to the current working >> directory (i.e. cwd or "."). >> >> That said, the temp folder for extraction step 3 is different. Its temp >> files default to a newly-created "tmp" folder under the parent directory >> of the path value set with "--extract-file". If "--extract-file" is not >> set, the default is "--model-dir / extract", which also follows the >> "--model-dir" path defined above. >> >> Because of this complexity, we advise our customers to use hardware >> configuration that relies on one large root folder partition... i.e. 1 >> or 2 TB mounted as "/" root without any other mounts. This gives Moses >> one contiguous pool of temp storage independent of the configuration >> complexities. I acknowledge this recommendation violates normal system >> administrators' configurations, but without significant changes to >> train-model.perl, this is our best recommendation. >> >> >> >> On 02/27/2015 04:28 AM, Barry Haddow wrote: >> >> Hi Alexander >> >> From the error logs, it looks as though alignment went fine, the >> training pipeline reports 24860460 lines of aligned bitext. Since the >> extract files were empty, I'd suggest that extraction crashed, and the >> most likely is that it ran out of disk. I'm not sure what happened to >> the error messages. >> >> For 25M sentence pairs, the final phrase table could easily be 30G and >> the intermediate files are larger. You probably need more like 500G to >> be safe. >> >> I would follow Tom's advice and start with a much smaller corpus to >> see how the process works. Also, for the full corpus, you could look >> in to fast_align (https://github.com/clab/fast_align) for alignment as >> it is much faster than mgiza (e.g. 2 days versus 2 weeks), and use EMS >> for large jobs since it's much easier to restart a failed step. >> >> cheers - Barry >> >> On 26/02/15 15:06, ????????? ??????? wrote: >> >> >> Hi Barry! >> >> Here you can download training.out >> https://www.dropbox.com/s/d0f0n99x4wbw3mo/training.out.gz?dl=1 >> >> I have about 50 Gb of free space in working dir. >> >> >> 2015-02-25 17:19 GMT+07:00 Barry Haddow <[email protected] >> <mailto:[email protected]>>: >> >> Hi Alexander, >> >> It looks like something went wrong at the extract stage. If you >> could make your training.out available then we can look for clues. >> >> Could the system have run out of disk space, either in the >> working directory or in /tmp? A lot of space is required to build >> the extract files and phrase tables. >> >> cheers - Barry >> >> >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: >> http://mailman.mit.edu/mailman/private/moses-support/attachments/20150227/2f4f190f/attachment-0001.htm >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> > > > _______________________________________________ > Moses-support mailing > [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support > > > > > _______________________________________________ > Moses-support mailing > [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support > > > > > > _______________________________________________ > Moses-support mailing > [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
