Re: [Moses-support] Faster decoding with multiple moses instances

Michael Denkowski Thu, 08 Oct 2015 12:07:47 -0700

Hi all,

I extended the multi_moses.py script to support multi-threaded moses
instances for cases where memory limits the number of decoders that can run
in parallel.  The threads arg now takes the form "--threads P:T:E" to run P
processes using T threads each and an optional extra process running E
threads.  The script sends input lines to instances as they have free
threads so all CPUs stay busy for the full decoding run.


I ran some more bench marks with the CompactPT system trading off between
threads and processes:

procs/threads per  sent/sec
1x16                   5.46
2x8                    7.58
4x4                    9.71
8x2                   12.50
16x1                  14.08

>From the results so far, it's best to use as many instances as will fit
into memory and evenly distribute CPUs.  For example, a system with 32 CPUs
that could fit 3 copies of moses into memory could use "--threads 2:11:10"
to run 2 instances with 11 threads each and 1 instance with 10 threads.
The script can be used with mert-moses.pl via the --multi-moses flag and
--decoder-flags='--threads P:T:E'.

Best,
Michael


On Tue, Oct 6, 2015 at 4:39 PM, Michael Denkowski <
[email protected]> wrote:

> Hi Hieu and all,
>
> I just checked in a bug fix for the multi_moses.py script.  I forgot to
> override the number of threads for each moses command, so if [threads] were
> specified in the moses.ini, the multi-moses runs were cheating by running a
> bunch of multi-threaded instances.  If threads were only being specified on
> the command line, the script was correctly stripping the flag so everything
> should be good.  I finished a benchmark on my system with an unpruned
> compact PT (with the fixed script) and got the following:
>
> 16 threads 5.38 sent/sec
> 16 procs  13.51 sent/sec
>
> This definitely used a lot more memory though.  Based on some very rough
> estimates looking at free system memory, the memory mapped suffix array PT
> went from 2G to 6G with 16 processes while the compact PT went from 3G to
> 37G.  For cases where everything fits into memory, I've seen significant
> speedup from multi-process decoding.
>
> For cases where things don't fit into memory, the multi-moses script could
> be extended to start as many multi-threaded instances as will fit into ram
> and farm out sentences in a way that keeps all of the CPUs busy.  I know
> Marcin has mentioned using GNU parallel.
>
> Best,
> Michael
>
> On Tue, Oct 6, 2015 at 4:16 PM, Hieu Hoang <[email protected]> wrote:
>
>> I've just run some comparison between multithreaded decoder and the
>> multi_moses.py script. It's good stuff.
>>
>> It make me seriously wonder whether we should use abandon multi-threading
>> and go all out for the multi-process approach.
>>
>> There's some advantage to multi-thread - eg. where model files are loaded
>> into memory rather than memory map. But there's disadvantages too - it more
>> difficult to maintain and there's about a 10% overhead.
>>
>> What do people think?
>>
>> Phrase-based:
>>
>> 1 5 10 15 20 25 30 32 real    4m37.000s real    1m15.391s real
>> 0m51.217s real    0m48.287s real    0m50.719s real    0m52.027s real
>> 0m53.045s Baseline (Compact pt) user    4m21.544s user    5m28.597s user
>> 6m38.227s user    8m0.975s user    8m21.122s user    8m3.195s user
>> 8m4.663s
>> sys     0m15.451s sys     0m34.669s sys     0m53.867s sys     1m10.515s
>> sys     1m20.746s sys     1m24.368s sys     1m23.677s
>>
>>
>>
>>
>>
>>
>>
>> 34 4m49.474s real    1m17.867s real    0m43.096s real    0m31.999s
>> 0m26.497s 0m26.296s killed (32) + multi_moses 4m33.580s user    4m40.486s
>> user    4m56.749s user    5m6.692s 5m43.845s 7m34.617s
>>
>> 0m15.957s sys     0m32.347s sys     0m51.016s sys     1m11.106s 1m44.115s
>> 2m21.263s
>>
>>
>>
>>
>>
>>
>>
>>
>> 38 real    4m46.254s real    1m16.637s real    0m49.711s real
>> 0m48.389s real    0m49.144s real    0m51.676s real    0m52.472s Baseline
>> (Probing pt) user    4m30.596s user    5m32.500s user    6m23.706s user
>> 7m40.791s user    7m51.946s user    7m52.892s user    7m53.569s
>> sys     0m15.624s sys     0m36.169s sys     0m49.433s sys     1m6.812s
>> sys     1m9.614s sys     1m13.108s sys     1m12.644s
>>
>>
>>
>>
>>
>>
>>
>> 39 real    4m43.882s real    1m17.849s real    0m34.245s real
>> 0m31.318s real    0m28.054s real    0m24.120s real    0m22.520s (38) +
>> multi moses user    4m29.212s user    4m47.693s user    5m5.750s user
>> 5m33.573s user    6m18.847s user    7m19.642s user    8m38.013s
>> sys     0m15.835s sys     0m25.398s sys     0m36.716s sys     0m41.349s
>> sys     0m48.494s sys     1m0.843s sys     1m13.215s
>> Hiero:
>> 3 real    5m33.011s real    1m28.935s real    0m59.470s real    1m0.315s
>> real    0m55.619s real    0m57.347s real    0m59.191s 1m2.786s 6/10
>> baseline user    4m53.187s user    6m23.521s user    8m17.170s user
>> 12m48.303s user    14m45.954s user    17m58.109s user    20m22.891s
>> 21m13.605s
>> sys     0m39.696s sys     0m51.519s sys     1m3.788s sys     1m22.125s
>> sys     1m58.718s sys     2m51.249s sys     4m4.807s 4m37.691s
>>
>>
>>
>>
>>
>>
>>
>>
>> 4
>> real    1m27.215s real    0m40.495s real    0m36.206s real    0m28.623s
>> real    0m26.631s real    0m25.817s 0m25.401s (3) + multi_moses
>> user    5m4.819s user    5m42.070s user    5m35.132s user    6m46.001s
>> user    7m38.151s user    9m6.500s 10m32.739s
>>
>> sys     0m38.039s sys     0m45.753s sys     0m44.117s sys     0m52.285s
>> sys     0m56.655s sys     1m6.749s 1m16.935s
>>
>> On 05/10/2015 16:05, Michael Denkowski wrote:
>>
>> Hi Philipp,
>>
>> Unfortunately I don't have a precise measurement.  If anyone knows of a
>> good way to benchmark a process tree with lots of memory mapping the same
>> files, I would be glad to run it.
>>
>> --Michael
>>
>> On Mon, Oct 5, 2015 at 10:26 AM, Philipp Koehn <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> great - that will be very useful.
>>>
>>> Since you just ran the comparison - do you have any numbers on "still
>>> allowed everything to fit into memory", i.e., how much more memory is used
>>> by running parallel instances?
>>>
>>> -phi
>>>
>>> On Mon, Oct 5, 2015 at 10:15 AM, Michael Denkowski <
>>> <[email protected]>[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Like some other Moses users, I noticed diminishing returns from running
>>>> Moses with several threads.  To work around this, I added a script to run
>>>> multiple single-threaded instances of moses instead of one multi-threaded
>>>> instance.  In practice, this sped things up by about 2.5x for 16 cpus and
>>>> using memory mapped models still allowed everything to fit into memory.
>>>>
>>>> If anyone else is interested in using this, you can prefix a moses
>>>> command with scripts/generic/multi_moses.py.  To use multiple instances in
>>>> mert-moses.pl, specify --multi-moses and control the number of
>>>> parallel instances with --decoder-flags='-threads N'.
>>>>
>>>> Below is a benchmark on WMT fr-en data (2M training sentences, 400M
>>>> words mono, suffix array PT, compact reordering, 5-gram KenLM) testing
>>>> default stack decoding vs cube pruning without and with the parallelization
>>>> script (+multi):
>>>>
>>>> ---
>>>> 1cpu   sent/sec
>>>> stack      1.04
>>>> cube       2.10
>>>> ---
>>>> 16cpu  sent/sec
>>>> stack      7.63
>>>> +multi    12.20
>>>> cube       7.63
>>>> +multi    18.18
>>>> ---
>>>>
>>>> --Michael
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>>
>>>
>>
>>
>> _______________________________________________
>> Moses-support mailing 
>> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>> --
>> Hieu Hoanghttp://www.hoang.co.uk/hieu
>>
>>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Faster decoding with multiple moses instances

Reply via email to