Re: [Moses-support] My phrase-table.tgz is 20-bytes long

Венцислав Жечев (Ventsislav Zhechev) Mon, 02 Mar 2015 07:20:32 -0800

One of the improvements we have in there takes care of running steps 5 to 7 
largely in parallel, which can help speed things up if you have the CPUs 
available (and you probably do).
Specifically,
- while doing phrase extraction, you’ll have three sort processes running at 
the same time (so four in total), working in parallel on the three extraction 
outputs (extract, extract.inv and extract.o)
- as soon as the sorting is done, the scoring processes will start working, 
feeding further processing; so you’ll have the two phrase table half scorings, 
the reordering scoring, the sorting of the inv phrase table half and—if 
desired—the binarisation of the reordering table going at the same time
- finally—once the sorting of the inv phrase table half is done—you can have 
the phrase table consolidation and binarisation working at the same time in a 
pipeline.
The only temp disk space you’ll need during this process is for the sorting 
and—if you have the RAM—you can control this to some extent.


Compared to the GIZA runtime this might not give you much advantage overall, 
but why leave those CPUs idling, right?


Cheers,

Ventzi



> 2.03.2015 г., в 15:55, [email protected] написал(а):
> 
> Message: 1
> Date: Mon, 02 Mar 2015 15:55:48 +0100
> From: Marcin Junczys-Dowmunt <[email protected]>
> Subject: Re: [Moses-support] My phrase-table.tgz is 20-bytes long
> To: Hieu Hoang <[email protected]>, [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="utf-8"
> 
> There are a few tools that write out gzipped files and they take ages. I 
> think the reordering model is first written to an intermediate text file 
> (single process) and then compressed from within the binary again with a 
> single process instead of using a pipe to gzip/pigz. I believe there are 
> a couple of more examples.
> 
> W dniu 02.03.2015 o 15:39, Hieu Hoang pisze:
>> probably not by much. The cpu is saturated anyway when you use 
>> parallelism.
>> 
>> pigz is used during the sorting. you can even use it to zip the sort's 
>> intermediate files by putting this in the train-model.perl options:
>>   -sort-compress pigz
>> 
>> 
>> On 02/03/2015 13:51, Marcin Junczys-Dowmunt wrote:
>>> When combined with pigz (parallel gzip), the piped approach is 
>>> probably faster. Just saying.
>>> 
>>> W dniu 02.03.2015 o 14:48, Hieu Hoang pisze:
>>>> thanks ventsis,
>>>> 
>>>> i think most of the improvements in your branch are now in the main 
>>>> Moses branch, but done in a different way. For instance, disk IO is 
>>>> reduced in yours by using named pipes that is used to unzip files. 
>>>> In moses now, the zipped files are read directly by the executables, 
>>>> rather than unzipping 1st.
>>>> 
>>>> There's also support for parallelism in most stages of training
>>>> 
>>>> If deadyaga is running out of space during extract, it's limey he'll 
>>>> run out of space again at a later point, even with all the space 
>>>> optimisation turned on. The surest way is to run it on a bigger disk
>>>> 
>>>> Hieu Hoang
>>>> Research Associate (until March 2015)
>>>> ** searching for interesting commercial MT position **
>>>> University of Edinburgh
>>>> http://www.hoang.co.uk/hieu
>>>> 
>>>> 
>>>> On 2 March 2015 at 07:45, "????????? ????? (Ventsislav Zhechev)" 
>>>> <[email protected] <mailto:[email protected]>> 
>>>> wrote:
>>>> 
>>>>    Hi all,
>>>>    I?m chiming in a bit late on this conversation, but thought what
>>>>    I have could help.
>>>> 
>>>>    At Autodesk we modified the training process a few years back to
>>>>    avoid the necessity of using so much temp disk space and to
>>>>    reduce the overall disk I/O during training?all with the premise
>>>>    that in a stable environment we don?t care about the
>>>>    intermediary files.
>>>> 
>>>>    Our code is available on GitHub at
>>>>    https://github.com/venyz/ADSKMosesTraining
>>>>    Have a look specifically at
>>>>    
>>>> https://github.com/venyz/ADSKMosesTraining/blob/master/scripts/training/train-model.perl
>>>>    around lines 500?600, where we launch a bunch of processing
>>>>    steps in parallel setting up the data transfer between tools via
>>>>    unix named pipes. For this to work properly, we also had to
>>>>    modify some of the other training workflow tools, but never had
>>>>    the time and resources to reintegrate our changes in the Moses
>>>>    master (at the time Moses wasn?t even on Git yet).
>>>>    Though I doubt you can use this directly with the current moses
>>>>    training tools, I hope this gives you inspiration and helps
>>>>    solve Alexander?s disk space issue.
>>>> 
>>>>    Also, Tom you said that you?re working on updating
>>>>    train-model.perl anyway, so probably our modifications could be
>>>>    something you can consider using.
>>>> 
>>>> 
>>>>    For more information on what we actually did and a numerical
>>>>    comparison of the disk usage statistics with and without our
>>>>    modifications, check Section 2.1 in
>>>>    Zhechev, Ventsislav. 2012. Machine Translation Infrastructure
>>>>    and Post-editing Performance at Autodesk. In Proceedings of the
>>>>    AMTA 2012 Workshop on Post-editing Technology and Practice (WPTP
>>>>    ?12), eds. Sharon O?Brien, Michel Simard and Lucia Specia. San
>>>>    Diego, CA.
>>>>    (http://amta2012.amtaweb.org/AMTA2012Files/html/5/5_paper.pdf)
>>>> 
>>>> 
>>>>    Cheers,
>>>> 
>>>>    Ventzi
>>>> 
>>>>    ???????
>>>>    Dr. Ventsislav Zhechev
>>>>    Computational Linguist, Certified ScrumMaster?
>>>>    Platform Architecture and Technologies
>>>>    Localisation Services
>>>> 
>>>>    MAIN +41 32 723 91 22 <tel:%2B41%2032%20723%2091%2022>
>>>>    FAX +41 32 723 93 99 <tel:%2B41%2032%20723%2093%2099>
>>>> 
>>>>    http://VentsislavZhechev.eu
>>>> 
>>>>    Autodesk, Inc.
>>>>    Rue de Puits-Godet 6
>>>>    2000 Neuch?tel, Switzerland
>>>>    www.autodesk.com <http://www.autodesk.com>
>>>> 
>>>> 
>>>> 
>>>> 
>>>>>    27.02.2015 ?., ? 12:49, [email protected]
>>>>>    <mailto:[email protected]> ???????(?):
>>>>> 
>>>>>    Message: 3
>>>>>    Date: Fri, 27 Feb 2015 08:25:01 +0700
>>>>>    From: Tom Hoar <[email protected]
>>>>>    <mailto:[email protected]>>
>>>>>    Subject: Re: [Moses-support] My phrase-table.tgz is 20-bytes long
>>>>>    To: [email protected] <mailto:[email protected]>
>>>>>    Message-ID: <[email protected]
>>>>>    <mailto:[email protected]>>
>>>>>    Content-Type: text/plain; charset="utf-8"
>>>>> 
>>>>> 
>>>>>    The temp space challenge with train-model.perl is not only how
>>>>>    much temp
>>>>>    space is needed, but also where are the temp files created?
>>>>>    train-model.perl seems unpredictable about where temp files are
>>>>>    placed.
>>>>> 
>>>>>    In general, the /tmp folder goes unused unless you specify it
>>>>>    with the
>>>>>    "--temp-dir" option. Even when specified, only the sort
>>>>>    functions use
>>>>>    the configuration it. Without setting the option,e sort temp
>>>>>    files go to
>>>>>    the "--model-dir" folder, which by default is "--root-dir /
>>>>>    model".
>>>>>    Finally, if not specified, --root-dir defaults to the current
>>>>>    working
>>>>>    directory (i.e. cwd or ".").
>>>>> 
>>>>>    That said, the temp folder for extraction step 3 is different.
>>>>>    Its temp
>>>>>    files default to a newly-created "tmp" folder under the parent
>>>>>    directory
>>>>>    of the path value set with "--extract-file". If
>>>>>    "--extract-file" is not
>>>>>    set, the default is "--model-dir / extract", which also follows
>>>>>    the
>>>>>    "--model-dir" path defined above.
>>>>> 
>>>>>    Because of this complexity, we advise our customers to use
>>>>>    hardware
>>>>>    configuration that relies on one large root folder partition...
>>>>>    i.e. 1
>>>>>    or 2 TB mounted as "/" root without any other mounts. This
>>>>>    gives Moses
>>>>>    one contiguous pool of temp storage independent of the
>>>>>    configuration
>>>>>    complexities. I acknowledge this recommendation violates normal
>>>>>    system
>>>>>    administrators' configurations, but without significant changes to
>>>>>    train-model.perl, this is our best recommendation.
>>>>> 
>>>>> 
>>>>> 
>>>>>    On 02/27/2015 04:28 AM, Barry Haddow wrote:
>>>>>>    Hi Alexander
>>>>>> 
>>>>>>    From the error logs, it looks as though alignment went fine, the
>>>>>>    training pipeline reports 24860460 lines of aligned bitext.
>>>>>>    Since the
>>>>>>    extract files were empty, I'd suggest that extraction crashed,
>>>>>>    and the
>>>>>>    most likely is that it ran out of disk. I'm not sure what
>>>>>>    happened to
>>>>>>    the error messages.
>>>>>> 
>>>>>>    For 25M sentence pairs, the final phrase table could easily be
>>>>>>    30G and
>>>>>>    the intermediate files are larger. You probably need more like
>>>>>>    500G to
>>>>>>    be safe.
>>>>>> 
>>>>>>    I would follow Tom's advice and start with a much smaller
>>>>>>    corpus to
>>>>>>    see how the process works. Also, for the full corpus, you
>>>>>>    could look
>>>>>>    in to fast_align (https://github.com/clab/fast_align) for
>>>>>>    alignment as
>>>>>>    it is much faster than mgiza (e.g. 2 days versus 2 weeks), and
>>>>>>    use EMS
>>>>>>    for large jobs since it's much easier to restart a failed step.
>>>>>> 
>>>>>>    cheers - Barry
>>>>>> 
>>>>>>    On 26/02/15 15:06, ????????? ??????? wrote:
>>>>>>> 
>>>>>>>    Hi Barry!
>>>>>>> 
>>>>>>>    Here you can download training.out
>>>>>>>    https://www.dropbox.com/s/d0f0n99x4wbw3mo/training.out.gz?dl=1
>>>>>>> 
>>>>>>>    I have about 50 Gb of free space in working dir.
>>>>>>> 
>>>>>>> 
>>>>>>>    2015-02-25 17:19 GMT+07:00 Barry Haddow
>>>>>>>    <[email protected] <mailto:[email protected]>
>>>>>>>    <mailto:[email protected]
>>>>>>>    <mailto:[email protected]>>>:
>>>>>>> 
>>>>>>>       Hi Alexander,
>>>>>>> 
>>>>>>>       It looks like something went wrong at the extract stage.
>>>>>>>    If you
>>>>>>>       could make your training.out available then we can look
>>>>>>>    for clues.
>>>>>>> 
>>>>>>>       Could the system have run out of disk space, either in the
>>>>>>>       working directory or in /tmp? A lot of space is required
>>>>>>>    to build
>>>>>>>       the extract files and phrase tables.
>>>>>>> 
>>>>>>>       cheers - Barry
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>    _______________________________________________
>>>>>>    Moses-support mailing list
>>>>>>    [email protected] <mailto:[email protected]>
>>>>>>    http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>> 
>>>>>    -------------- next part --------------
>>>>>    An HTML attachment was scrubbed...
>>>>>    URL:
>>>>>    
>>>>> http://mailman.mit.edu/mailman/private/moses-support/attachments/20150227/2f4f190f/attachment-0001.htm
>>>> 
>>>> 
>>>>    _______________________________________________
>>>>    Moses-support mailing list
>>>>    [email protected] <mailto:[email protected]>
>>>>    http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> 
> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20150302/b8ba2140/attachment.htm
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: image/jpeg
> Size: 14277 bytes
> Desc: not available
> Url : 
> http://mailman.mit.edu/mailman/private/moses-support/attachments/20150302/b8ba2140/attachment.jpg


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] My phrase-table.tgz is 20-bytes long

Reply via email to