Hi all, I extended the multi_moses.py script to support multi-threaded moses instances for cases where memory limits the number of decoders that can run in parallel. The threads arg now takes the form "--threads P:T:E" to run P processes using T threads each and an optional extra process running E threads. The script sends input lines to instances as they have free threads so all CPUs stay busy for the full decoding run.
I ran some more bench marks with the CompactPT system trading off between threads and processes: procs/threads per sent/sec 1x16 5.46 2x8 7.58 4x4 9.71 8x2 12.50 16x1 14.08 >From the results so far, it's best to use as many instances as will fit into memory and evenly distribute CPUs. For example, a system with 32 CPUs that could fit 3 copies of moses into memory could use "--threads 2:11:10" to run 2 instances with 11 threads each and 1 instance with 10 threads. The script can be used with mert-moses.pl via the --multi-moses flag and --decoder-flags='--threads P:T:E'. Best, Michael On Tue, Oct 6, 2015 at 4:39 PM, Michael Denkowski < [email protected]> wrote: > Hi Hieu and all, > > I just checked in a bug fix for the multi_moses.py script. I forgot to > override the number of threads for each moses command, so if [threads] were > specified in the moses.ini, the multi-moses runs were cheating by running a > bunch of multi-threaded instances. If threads were only being specified on > the command line, the script was correctly stripping the flag so everything > should be good. I finished a benchmark on my system with an unpruned > compact PT (with the fixed script) and got the following: > > 16 threads 5.38 sent/sec > 16 procs 13.51 sent/sec > > This definitely used a lot more memory though. Based on some very rough > estimates looking at free system memory, the memory mapped suffix array PT > went from 2G to 6G with 16 processes while the compact PT went from 3G to > 37G. For cases where everything fits into memory, I've seen significant > speedup from multi-process decoding. > > For cases where things don't fit into memory, the multi-moses script could > be extended to start as many multi-threaded instances as will fit into ram > and farm out sentences in a way that keeps all of the CPUs busy. I know > Marcin has mentioned using GNU parallel. > > Best, > Michael > > On Tue, Oct 6, 2015 at 4:16 PM, Hieu Hoang <[email protected]> wrote: > >> I've just run some comparison between multithreaded decoder and the >> multi_moses.py script. It's good stuff. >> >> It make me seriously wonder whether we should use abandon multi-threading >> and go all out for the multi-process approach. >> >> There's some advantage to multi-thread - eg. where model files are loaded >> into memory rather than memory map. But there's disadvantages too - it more >> difficult to maintain and there's about a 10% overhead. >> >> What do people think? >> >> Phrase-based: >> >> 1 5 10 15 20 25 30 32 real 4m37.000s real 1m15.391s real >> 0m51.217s real 0m48.287s real 0m50.719s real 0m52.027s real >> 0m53.045s Baseline (Compact pt) user 4m21.544s user 5m28.597s user >> 6m38.227s user 8m0.975s user 8m21.122s user 8m3.195s user >> 8m4.663s >> sys 0m15.451s sys 0m34.669s sys 0m53.867s sys 1m10.515s >> sys 1m20.746s sys 1m24.368s sys 1m23.677s >> >> >> >> >> >> >> >> 34 4m49.474s real 1m17.867s real 0m43.096s real 0m31.999s >> 0m26.497s 0m26.296s killed (32) + multi_moses 4m33.580s user 4m40.486s >> user 4m56.749s user 5m6.692s 5m43.845s 7m34.617s >> >> 0m15.957s sys 0m32.347s sys 0m51.016s sys 1m11.106s 1m44.115s >> 2m21.263s >> >> >> >> >> >> >> >> >> 38 real 4m46.254s real 1m16.637s real 0m49.711s real >> 0m48.389s real 0m49.144s real 0m51.676s real 0m52.472s Baseline >> (Probing pt) user 4m30.596s user 5m32.500s user 6m23.706s user >> 7m40.791s user 7m51.946s user 7m52.892s user 7m53.569s >> sys 0m15.624s sys 0m36.169s sys 0m49.433s sys 1m6.812s >> sys 1m9.614s sys 1m13.108s sys 1m12.644s >> >> >> >> >> >> >> >> 39 real 4m43.882s real 1m17.849s real 0m34.245s real >> 0m31.318s real 0m28.054s real 0m24.120s real 0m22.520s (38) + >> multi moses user 4m29.212s user 4m47.693s user 5m5.750s user >> 5m33.573s user 6m18.847s user 7m19.642s user 8m38.013s >> sys 0m15.835s sys 0m25.398s sys 0m36.716s sys 0m41.349s >> sys 0m48.494s sys 1m0.843s sys 1m13.215s >> Hiero: >> 3 real 5m33.011s real 1m28.935s real 0m59.470s real 1m0.315s >> real 0m55.619s real 0m57.347s real 0m59.191s 1m2.786s 6/10 >> baseline user 4m53.187s user 6m23.521s user 8m17.170s user >> 12m48.303s user 14m45.954s user 17m58.109s user 20m22.891s >> 21m13.605s >> sys 0m39.696s sys 0m51.519s sys 1m3.788s sys 1m22.125s >> sys 1m58.718s sys 2m51.249s sys 4m4.807s 4m37.691s >> >> >> >> >> >> >> >> >> 4 >> real 1m27.215s real 0m40.495s real 0m36.206s real 0m28.623s >> real 0m26.631s real 0m25.817s 0m25.401s (3) + multi_moses >> user 5m4.819s user 5m42.070s user 5m35.132s user 6m46.001s >> user 7m38.151s user 9m6.500s 10m32.739s >> >> sys 0m38.039s sys 0m45.753s sys 0m44.117s sys 0m52.285s >> sys 0m56.655s sys 1m6.749s 1m16.935s >> >> On 05/10/2015 16:05, Michael Denkowski wrote: >> >> Hi Philipp, >> >> Unfortunately I don't have a precise measurement. If anyone knows of a >> good way to benchmark a process tree with lots of memory mapping the same >> files, I would be glad to run it. >> >> --Michael >> >> On Mon, Oct 5, 2015 at 10:26 AM, Philipp Koehn <[email protected]> wrote: >> >>> Hi, >>> >>> great - that will be very useful. >>> >>> Since you just ran the comparison - do you have any numbers on "still >>> allowed everything to fit into memory", i.e., how much more memory is used >>> by running parallel instances? >>> >>> -phi >>> >>> On Mon, Oct 5, 2015 at 10:15 AM, Michael Denkowski < >>> <[email protected]>[email protected]> wrote: >>> >>>> Hi all, >>>> >>>> Like some other Moses users, I noticed diminishing returns from running >>>> Moses with several threads. To work around this, I added a script to run >>>> multiple single-threaded instances of moses instead of one multi-threaded >>>> instance. In practice, this sped things up by about 2.5x for 16 cpus and >>>> using memory mapped models still allowed everything to fit into memory. >>>> >>>> If anyone else is interested in using this, you can prefix a moses >>>> command with scripts/generic/multi_moses.py. To use multiple instances in >>>> mert-moses.pl, specify --multi-moses and control the number of >>>> parallel instances with --decoder-flags='-threads N'. >>>> >>>> Below is a benchmark on WMT fr-en data (2M training sentences, 400M >>>> words mono, suffix array PT, compact reordering, 5-gram KenLM) testing >>>> default stack decoding vs cube pruning without and with the parallelization >>>> script (+multi): >>>> >>>> --- >>>> 1cpu sent/sec >>>> stack 1.04 >>>> cube 2.10 >>>> --- >>>> 16cpu sent/sec >>>> stack 7.63 >>>> +multi 12.20 >>>> cube 7.63 >>>> +multi 18.18 >>>> --- >>>> >>>> --Michael >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>>> >>> >> >> >> _______________________________________________ >> Moses-support mailing >> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> -- >> Hieu Hoanghttp://www.hoang.co.uk/hieu >> >> >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
