Dear Hieu, Thank you very much for your e-mail and all the effort you put into re-running the experiments!
I think however that an exel of values gives a rather obscured view on the results. I would rather prefer to see the plots of pure decoding times or wps or speed increment per number of cores. Any how, I see that my main concern about the Moses vs Moses2 scalability is actulally. The speed-up of Moses 2 is not so much related to better multi-threading but is more of a single thread decoding speed improvement and the original results listed on the website are flawed. Regarding the reasons you gave me in the previous e-mail. Not that I find it important now that I see that you have the same results, but I already pointed out that the cold/hot data issue is not possible, so we can rule it out. Some other "parameter miss match" did sound strange to me as all the parameters I used are listed in the experimental set-up section: https://github.com/ivan-zapreev/Basic-Translation-Infrastructure#test-set-up-1 So there is nothing else that could be different except for the models and the texts themselves. Kind regards, Dr. Ivan S. Zapreev On Sun, Apr 9, 2017 at 8:42 PM, Hieu Hoang <[email protected]> wrote: > Hi Ivan > > I've finished running my experiments with the vanilla phrase-based > algorithm and memory phrase-table. The results are here: > https://docs.google.com/spreadsheets/d/15S1m-6MXNmxc47UiS- > AyHyHCLWtGEdxaAvU_0plsPjo/edit?usp=sharing > Summary: > 1. Moses2 is about 3 times faster than Moses. > 2. Both decoder is 15-16 times faster running with 32 threads than on 1 > thread (on a 16 cores/32 hyperthread server). > 3. Moses2 with binary phrase-table is slightly faster than loading the > pt into memory. > > I'm happy with the speed with of Moses2, and the scalability wrt number of > cores. The scalability is in line with that reported on the website on and > in the paper. > > The original Moses decoder also seem to have similar scalability, contrary > to my previous results. I have some explanation for it but I'm not too > concerned, it's great that Moses is also good! > > This doesn't correlate with some of your findings, I've outlined some > possible reasons in the last email. > > > * Looking for MT/NLP opportunities * > Hieu Hoang > http://moses-smt.org/ > > > On 4 April 2017 at 11:01, Ivan Zapreev <[email protected]> wrote: > >> Dear Hieu, >> >> Thank you for the feedback and the info. I am not sure what you mean by >> "good scalability", I can not really visualize the plots from numbers in my >> head. Sorry. >> >> Using bigger models is indeed always good but I used the biggest there >> were available. >> >> I did make sure there was no swapping, I already mentioned it. >> >> I did take the average run times for loading and decoding and just >> loading with standard deviations. >> The latter show that the way things are measured were reliable. >> >> The L1 and L2 cache issues do not sound convincing to me. The caches are >> just up to some Mbs and the models you work with are gigabytes. There will >> always be cache misses in this setting. The only issue I can think of is if >> the data is not fully pre-loaded into RAM then you have a cold-run but not >> more than that. >> >> I think if you finish the runs and then could plot the results, then we >> could see the clearer picture... >> >> Thanks again! >> >> Kind regards, >> >> Ivan >> >> >> On Tue, Apr 4, 2017 at 11:40 AM, Hieu Hoang <[email protected]> wrote: >> >>> Thanks Ivan, >>> >>> I'm running the experiments with my models, using the text-based >>> phrase-table that you used. Experiments are still running, they may take a >>> week to finish. >>> >>> However, preliminary results suggest good scalability with Moses2. >>> https://docs.google.com/spreadsheets/d/15S1m-6MXNmxc47UiS-Ay >>> HyHCLWtGEdxaAvU_0plsPjo/edit?usp=sharing >>> My models are here for you to download and test yourself if you like: >>> http://statmt.org/~s0565741/download/for-ivan/ >>> >>> Below are my thoughts on possible reasons why there are discrepencies in >>> what we're seeing: >>> 1. You may have parameters in your moses.ini which are vastly >>> different from mine and suboptimal for speed and scability. We won't know >>> until we compare our 2 setups >>> 2. Yours and mine phrase-tables are vastly different sizes. Your >>> phrase-table is 1.3GB, mine is 40GB (unzipped). I've also tested on a 15GB >>> phrase-table in our AMTA paper and also got good scalability, but have I >>> not tried one as small as yours. There may be phenomena that causes Moses2 >>> to be bad with small models >>> 3. You loaded all models into memory, I loaded the phrase-table into >>> memory binary but had to use binary LM and reordering models. My models are >>> too large to load into RAM (they take up more RAM than the file size >>> suggest). >>> 4. You may be also running our of RAM by loading everything into >>> memory, causing disk swapping >>> 5. Your test set (1788 sentences) is too small. My test set is 800,000 >>> sentences (5,847,726 tokens). The decoders rely on CPU caches (L1, L2 etc) >>> for speed. There are also setup costs for each decoding thread (eg. >>> creating memory pools in Moses2). If your experiment are over too quickly, >>> you may be measuring the decoder in the 'warm-up lap' rather than when it's >>> running at terminal velocity. Your quickest decoding experiments took 25 >>> sec, my quickest took 200 sec. >>> 6. I think the way you exclude load time is unreliable. You exclude >>> load time by subtracting the average load time from the total time. >>> However, load time is multiple times more than decoding time so any >>> variation in load time will swamp the decoding time. I use the decoder's >>> debugging timing output. >>> >>> If you can share your models, we might be able to find out the reason >>> for the difference in our results. I can provide you with ssh/scp access to >>> my server if you need to. >>> >>> >>> >>> * Looking for MT/NLP opportunities * >>> Hieu Hoang >>> http://moses-smt.org/ >>> >>> >>> On 4 April 2017 at 09:00, Ivan Zapreev <[email protected]> wrote: >>> >>>> Dear Hieu, >>>> >>>> Please see the answers below. >>>> >>>> Can you clarify a few things for me. >>>>> 1. How many sentences, words were in the test set you used to >>>>> measure decoding speed? Are there many duplicate sentence - ie. did you >>>>> create a large test set by concatenating the same small test set multiple >>>>> times? >>>>> >>>> >>>> We run experiments on the same MT04 Chinese text as we tuned the >>>> system. The text consists of 1788 unique sentences and 49582 tokens. >>>> >>>> >>>>> 2. Are the model sizes you quoted the gzipped text files or >>>>> unzipped, or the model size as it is when loaded into memory? >>>>> >>>> >>>> This is the plain text models as stored on the hard drive. >>>> >>>> >>>>> 3. Can you please reformat this graph >>>>> https://github.com/ivan-zapreev/Basic-Translation-Infrastruc >>>>> ture/blob/master/doc/images/experiments/servers/stats.time.t >>>>> ools.log.png >>>>> as #threads v. words per second. Ie. don't use log, don't use >>>>> decoding time. >>>>> >>>> >>>> The plot is attached, but this one is not about words per second, it >>>> shows the decoding run-times (as in the link you sent). The non-log scale >>>> plot, as you will see, is hard to read. I also attach the plain data files >>>> for moses and moses2 with the column values as follows: >>>> >>>> number of threads | average runtime decoding + model loading | standard >>>> deviation | average runtime model loading | standard deviation >>>> >>>> -- >>>> Best regards, >>>> >>>> Ivan >>>> <http://www.tainichok.ru/> >>>> >>> >>> >> >> >> -- >> Best regards, >> >> Ivan >> <http://www.tainichok.ru/> >> > > -- Best regards, Ivan <http://www.tainichok.ru/>
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
