Re: [Moses-support] My phrase-table.tgz is 20-bytes long

Tom Hoar Mon, 02 Mar 2015 07:34:33 -0800

Yes, this is how the "lexical-reordering-score" binary works. We willlikely update that with our Windows updates (also apply Linux).

Also, the consolidate binary's command-line arguments only supports afile system file as output. The train-model.perl script uses/dev/stdout, which works but is not cross-platform friendly. Tests showthat many other binaries hard-codes /dev/stdout. We will update those tostandard file descriptors. In cases with command-line options, doesanyone have objections to us updating appropriate binaries to supportthe standard GNU "-" dash to represent file descriptors forstdin/stdout/stderr?



On 03/02/2015 09:55 PM, Marcin Junczys-Dowmunt wrote:

There are a few tools that write out gzipped files and they take ages.I think the reordering model is first written to an intermediate textfile (single process) and then compressed from within the binary againwith a single process instead of using a pipe to gzip/pigz. I believethere are a couple of more examples.


W dniu 02.03.2015 o 15:39, Hieu Hoang pisze:

probably not by much. The cpu is saturated anyway when you useparallelism.

pigz is used during the sorting. you can even use it to zip thesort's intermediate files by putting this in the train-model.perloptions:

   -sort-compress pigz


On 02/03/2015 13:51, Marcin Junczys-Dowmunt wrote:

When combined with pigz (parallel gzip), the piped approach isprobably faster. Just saying.


W dniu 02.03.2015 o 14:48, Hieu Hoang pisze:

thanks ventsis,

i think most of the improvements in your branch are now in the mainMoses branch, but done in a different way. For instance, disk IO isreduced in yours by using named pipes that is used to unzip files.In moses now, the zipped files are read directly by theexecutables, rather than unzipping 1st.


There's also support for parallelism in most stages of training

If deadyaga is running out of space during extract, it's limeyhe'll run out of space again at a later point, even with all thespace optimisation turned on. The surest way is to run it on abigger disk


Hieu Hoang
Research Associate (until March 2015)
** searching for interesting commercial MT position **
University of Edinburgh
http://www.hoang.co.uk/hieu

On 2 March 2015 at 07:45, "Венцислав Жечев (Ventsislav Zhechev)"<[email protected]<mailto:[email protected]>> wrote:


    Hi all,
    I’m chiming in a bit late on this conversation, but thought
    what I have could help.

    At Autodesk we modified the training process a few years back
    to avoid the necessity of using so much temp disk space and to
    reduce the overall disk I/O during training—all with the
    premise that in a stable environment we don’t care about the
    intermediary files.

    Our code is available on GitHub at
    https://github.com/venyz/ADSKMosesTraining
    Have a look specifically at
    
https://github.com/venyz/ADSKMosesTraining/blob/master/scripts/training/train-model.perl
    around lines 500–600, where we launch a bunch of processing
    steps in parallel setting up the data transfer between tools
    via unix named pipes. For this to work properly, we also had to
    modify some of the other training workflow tools, but never had
    the time and resources to reintegrate our changes in the Moses
    master (at the time Moses wasn’t even on Git yet).
    Though I doubt you can use this directly with the current moses
    training tools, I hope this gives you inspiration and helps
    solve Alexander’s disk space issue.

    Also, Tom you said that you’re working on updating
    train-model.perl anyway, so probably our modifications could be
    something you can consider using.


    For more information on what we actually did and a numerical
    comparison of the disk usage statistics with and without our
    modifications, check Section 2.1 in
    Zhechev, Ventsislav. 2012. Machine Translation Infrastructure
    and Post-editing Performance at Autodesk. In Proceedings of the
    AMTA 2012 Workshop on Post-editing Technology and Practice
    (WPTP ’12), eds. Sharon O’Brien, Michel Simard and Lucia
    Specia. San Diego, CA.
    (http://amta2012.amtaweb.org/AMTA2012Files/html/5/5_paper.pdf)


    Cheers,

    Ventzi

    –––––––
    Dr. Ventsislav Zhechev
    Computational Linguist, Certified ScrumMaster®
    Platform Architecture and Technologies
    Localisation Services

    MAIN +41 32 723 91 22 <tel:%2B41%2032%20723%2091%2022>
    FAX +41 32 723 93 99 <tel:%2B41%2032%20723%2093%2099>

    http://VentsislavZhechev.eu

    Autodesk, Inc.
    Rue de Puits-Godet 6
    2000 Neuchâtel, Switzerland
    www.autodesk.com <http://www.autodesk.com>

    27.02.2015 г., в 12:49, [email protected]
    <mailto:[email protected]> написал(а):

    Message: 3
    Date: Fri, 27 Feb 2015 08:25:01 +0700
    From: Tom Hoar <[email protected]
    <mailto:[email protected]>>
    Subject: Re: [Moses-support] My phrase-table.tgz is 20-bytes long
    To: [email protected] <mailto:[email protected]>
    Message-ID: <[email protected]
    <mailto:[email protected]>>
    Content-Type: text/plain; charset="utf-8"


    The temp space challenge with train-model.perl is not only how
    much temp
    space is needed, but also where are the temp files created?
    train-model.perl seems unpredictable about where temp files
    are placed.

    In general, the /tmp folder goes unused unless you specify it
    with the
    "--temp-dir" option. Even when specified, only the sort
    functions use
    the configuration it. Without setting the option,e sort temp
    files go to
    the "--model-dir" folder, which by default is "--root-dir /
    model".
    Finally, if not specified, --root-dir defaults to the current
    working
    directory (i.e. cwd or ".").

    That said, the temp folder for extraction step 3 is different.
    Its temp
    files default to a newly-created "tmp" folder under the parent
    directory
    of the path value set with "--extract-file". If
    "--extract-file" is not
    set, the default is "--model-dir / extract", which also
    follows the
    "--model-dir" path defined above.

    Because of this complexity, we advise our customers to use
    hardware
    configuration that relies on one large root folder
    partition... i.e. 1
    or 2 TB mounted as "/" root without any other mounts. This
    gives Moses
    one contiguous pool of temp storage independent of the
    configuration
    complexities. I acknowledge this recommendation violates
    normal system
    administrators' configurations, but without significant
    changes to
    train-model.perl, this is our best recommendation.



    On 02/27/2015 04:28 AM, Barry Haddow wrote:

    Hi Alexander

    From the error logs, it looks as though alignment went fine, the
    training pipeline reports 24860460 lines of aligned bitext.
    Since the
    extract files were empty, I'd suggest that extraction
    crashed, and the
    most likely is that it ran out of disk. I'm not sure what
    happened to
    the error messages.

    For 25M sentence pairs, the final phrase table could easily
    be 30G and
    the intermediate files are larger. You probably need more
    like 500G to
    be safe.

    I would follow Tom's advice and start with a much smaller
    corpus to
    see how the process works. Also, for the full corpus, you
    could look
    in to fast_align (https://github.com/clab/fast_align) for
    alignment as
    it is much faster than mgiza (e.g. 2 days versus 2 weeks),
    and use EMS
    for large jobs since it's much easier to restart a failed step.

    cheers - Barry

    On 26/02/15 15:06, ????????? ??????? wrote:


    Hi Barry!

    Here you can download training.out
    https://www.dropbox.com/s/d0f0n99x4wbw3mo/training.out.gz?dl=1

    I have about 50 Gb of free space in working dir.


    2015-02-25 17:19 GMT+07:00 Barry Haddow
    <[email protected] <mailto:[email protected]>
    <mailto:[email protected]
    <mailto:[email protected]>>>:

       Hi Alexander,

       It looks like something went wrong at the extract stage.
    If you
       could make your training.out available then we can look
    for clues.

       Could the system have run out of disk space, either in the
       working directory or in /tmp? A lot of space is required
    to build
       the extract files and phrase tables.

       cheers - Barry




    _______________________________________________
    Moses-support mailing list
    [email protected] <mailto:[email protected]>
    http://mailman.mit.edu/mailman/listinfo/moses-support


    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL:
    
http://mailman.mit.edu/mailman/private/moses-support/attachments/20150227/2f4f190f/attachment-0001.htm



    _______________________________________________
    Moses-support mailing list
    [email protected] <mailto:[email protected]>
    http://mailman.mit.edu/mailman/listinfo/moses-support




_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support




_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support




_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] My phrase-table.tgz is 20-bytes long

Reply via email to