It's true that for this experiment (non-cube-pruning, loading the phrase-table into memory) that Moses scales as well as Moses2, and that the Moses2 is only 3 times faster, rather than 10 times.
However, the results on the website and paper is correct when using cube-pruning and binary phrase-table (compact pt for Moses, probing pt for Moses2). I've never tested your setup until now, and you've never tested mine so we're talking at cross purposes. I will add my recent results to the website to make it more complete * Looking for MT/NLP opportunities * Hieu Hoang http://moses-smt.org/ On 10 April 2017 at 10:07, Ivan Zapreev <[email protected]> wrote: > Dear Hieu, > > Thank you very much for your e-mail and all the effort you put into > re-running the experiments! > > I think however that an exel of values gives a rather obscured view on the > results. I would rather prefer to see the plots of pure decoding times or > wps or speed increment per number of cores. > > Any how, I see that my main concern about the Moses vs Moses2 scalability > is actulally. The speed-up of Moses 2 is not so much related to better > multi-threading but is more of a single thread decoding speed improvement > and the original results listed on the website are flawed. > > Regarding the reasons you gave me in the previous e-mail. Not that I find > it important now that I see that you have the same results, but I already > pointed out that the cold/hot data issue is not possible, so we can rule it > out. Some other "parameter miss match" did sound strange to me as all the > parameters I used are listed in the experimental set-up section: > https://github.com/ivan-zapreev/Basic-Translation- > Infrastructure#test-set-up-1 So there is nothing else that could be > different except for the models and the texts themselves. > > Kind regards, > > Dr. Ivan S. Zapreev > > On Sun, Apr 9, 2017 at 8:42 PM, Hieu Hoang <[email protected]> wrote: > >> Hi Ivan >> >> I've finished running my experiments with the vanilla phrase-based >> algorithm and memory phrase-table. The results are here: >> https://docs.google.com/spreadsheets/d/15S1m-6MXNmxc47UiS-Ay >> HyHCLWtGEdxaAvU_0plsPjo/edit?usp=sharing >> Summary: >> 1. Moses2 is about 3 times faster than Moses. >> 2. Both decoder is 15-16 times faster running with 32 threads than on 1 >> thread (on a 16 cores/32 hyperthread server). >> 3. Moses2 with binary phrase-table is slightly faster than loading the >> pt into memory. >> >> I'm happy with the speed with of Moses2, and the scalability wrt number >> of cores. The scalability is in line with that reported on the website on >> and in the paper. >> >> The original Moses decoder also seem to have similar scalability, >> contrary to my previous results. I have some explanation for it but I'm not >> too concerned, it's great that Moses is also good! >> >> This doesn't correlate with some of your findings, I've outlined some >> possible reasons in the last email. >> >> >> * Looking for MT/NLP opportunities * >> Hieu Hoang >> http://moses-smt.org/ >> >> >> On 4 April 2017 at 11:01, Ivan Zapreev <[email protected]> wrote: >> >>> Dear Hieu, >>> >>> Thank you for the feedback and the info. I am not sure what you mean by >>> "good scalability", I can not really visualize the plots from numbers in my >>> head. Sorry. >>> >>> Using bigger models is indeed always good but I used the biggest there >>> were available. >>> >>> I did make sure there was no swapping, I already mentioned it. >>> >>> I did take the average run times for loading and decoding and just >>> loading with standard deviations. >>> The latter show that the way things are measured were reliable. >>> >>> The L1 and L2 cache issues do not sound convincing to me. The caches are >>> just up to some Mbs and the models you work with are gigabytes. There will >>> always be cache misses in this setting. The only issue I can think of is if >>> the data is not fully pre-loaded into RAM then you have a cold-run but not >>> more than that. >>> >>> I think if you finish the runs and then could plot the results, then we >>> could see the clearer picture... >>> >>> Thanks again! >>> >>> Kind regards, >>> >>> Ivan >>> >>> >>> On Tue, Apr 4, 2017 at 11:40 AM, Hieu Hoang <[email protected]> wrote: >>> >>>> Thanks Ivan, >>>> >>>> I'm running the experiments with my models, using the text-based >>>> phrase-table that you used. Experiments are still running, they may take a >>>> week to finish. >>>> >>>> However, preliminary results suggest good scalability with Moses2. >>>> https://docs.google.com/spreadsheets/d/15S1m-6MXNmxc47UiS-Ay >>>> HyHCLWtGEdxaAvU_0plsPjo/edit?usp=sharing >>>> My models are here for you to download and test yourself if you like: >>>> http://statmt.org/~s0565741/download/for-ivan/ >>>> >>>> Below are my thoughts on possible reasons why there are discrepencies >>>> in what we're seeing: >>>> 1. You may have parameters in your moses.ini which are vastly >>>> different from mine and suboptimal for speed and scability. We won't know >>>> until we compare our 2 setups >>>> 2. Yours and mine phrase-tables are vastly different sizes. Your >>>> phrase-table is 1.3GB, mine is 40GB (unzipped). I've also tested on a 15GB >>>> phrase-table in our AMTA paper and also got good scalability, but have I >>>> not tried one as small as yours. There may be phenomena that causes Moses2 >>>> to be bad with small models >>>> 3. You loaded all models into memory, I loaded the phrase-table into >>>> memory binary but had to use binary LM and reordering models. My models are >>>> too large to load into RAM (they take up more RAM than the file size >>>> suggest). >>>> 4. You may be also running our of RAM by loading everything into >>>> memory, causing disk swapping >>>> 5. Your test set (1788 sentences) is too small. My test set is >>>> 800,000 sentences (5,847,726 tokens). The decoders rely on CPU caches (L1, >>>> L2 etc) for speed. There are also setup costs for each decoding thread (eg. >>>> creating memory pools in Moses2). If your experiment are over too quickly, >>>> you may be measuring the decoder in the 'warm-up lap' rather than when it's >>>> running at terminal velocity. Your quickest decoding experiments took 25 >>>> sec, my quickest took 200 sec. >>>> 6. I think the way you exclude load time is unreliable. You exclude >>>> load time by subtracting the average load time from the total time. >>>> However, load time is multiple times more than decoding time so any >>>> variation in load time will swamp the decoding time. I use the decoder's >>>> debugging timing output. >>>> >>>> If you can share your models, we might be able to find out the reason >>>> for the difference in our results. I can provide you with ssh/scp access to >>>> my server if you need to. >>>> >>>> >>>> >>>> * Looking for MT/NLP opportunities * >>>> Hieu Hoang >>>> http://moses-smt.org/ >>>> >>>> >>>> On 4 April 2017 at 09:00, Ivan Zapreev <[email protected]> wrote: >>>> >>>>> Dear Hieu, >>>>> >>>>> Please see the answers below. >>>>> >>>>> Can you clarify a few things for me. >>>>>> 1. How many sentences, words were in the test set you used to >>>>>> measure decoding speed? Are there many duplicate sentence - ie. did you >>>>>> create a large test set by concatenating the same small test set multiple >>>>>> times? >>>>>> >>>>> >>>>> We run experiments on the same MT04 Chinese text as we tuned the >>>>> system. The text consists of 1788 unique sentences and 49582 tokens. >>>>> >>>>> >>>>>> 2. Are the model sizes you quoted the gzipped text files or >>>>>> unzipped, or the model size as it is when loaded into memory? >>>>>> >>>>> >>>>> This is the plain text models as stored on the hard drive. >>>>> >>>>> >>>>>> 3. Can you please reformat this graph >>>>>> https://github.com/ivan-zapreev/Basic-Translation-Infrastruc >>>>>> ture/blob/master/doc/images/experiments/servers/stats.time.t >>>>>> ools.log.png >>>>>> as #threads v. words per second. Ie. don't use log, don't use >>>>>> decoding time. >>>>>> >>>>> >>>>> The plot is attached, but this one is not about words per second, it >>>>> shows the decoding run-times (as in the link you sent). The non-log scale >>>>> plot, as you will see, is hard to read. I also attach the plain data files >>>>> for moses and moses2 with the column values as follows: >>>>> >>>>> number of threads | average runtime decoding + model loading | >>>>> standard deviation | average runtime model loading | standard deviation >>>>> >>>>> -- >>>>> Best regards, >>>>> >>>>> Ivan >>>>> <http://www.tainichok.ru/> >>>>> >>>> >>>> >>> >>> >>> -- >>> Best regards, >>> >>> Ivan >>> <http://www.tainichok.ru/> >>> >> >> > > > -- > Best regards, > > Ivan > <http://www.tainichok.ru/> >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
