Hi Qin Gao, I don't know if you remember our past discussion but I've now finished my plans to parallelise EM with OpenMP and MPI. I am now writing up and need to be able to compare my work with your own.
In MGiza++ you are using POSIX threads instead of OpenMP right. So, did you employ some kind of mechanism to ensure that the number of threads is equal to the number of processors on the node? How did you do this? In OpenMP it's all done for you automagically. James Quoting Qin Gao <[email protected]>: > Hi, James, > > PGIZA++ is not using OpenMPI, and only use shared storage to transfer model > files, that could be a bottleneck, MGIZA++ is just using multi-thread. So > they are not quite complete and can be further improved, the advantage of > PGIZA++ is it already decomposed every training step (E/M, model 1, hmm > 3,4,5 etc), so it could be easy for you to understand the logic. > > The bottleneck is majorly the huge T-Table (translation table), which is > larger during model 1 training and then becomes smaller as training goes on, > every child has to get the table (typically several gigas for model 1 on > large data). I think OpenMPI etc is a right way to go, and please let me > know if you have any question on PGIZA++. > > Best, > Qin > > On Thu, Feb 19, 2009 at 4:54 PM, James Read <[email protected]> wrote: > >> Wow! >> >> Thanks for that. That was great. I've had a quick read through your paper. >> I'm guessing the basis of PGiza++ is OpenMPI calls and the basis of MGiza++ >> is OpenMP calls right? >> >> Your paper was very fascinating. You mentioned I/O bottlenecks quite a lot >> with reference to PGiza++ which is to be expected. Did you run any >> experiments to find what those bottlenecks typically are? How many >> processors did you hit before you started to lose speed up? Did this number >> vary for different data sets? >> >> Also, you mention breaking up the files into chunks and working on them on >> different processors. Obviously you're referring to some kind of data >> decomposition plan. Does your algorithm have any kind of intelligent data >> decomposition strategy for reducing communications? Or is it just a case of >> cutting the file up into n bits and assigning each one to a processor? >> >> The reason I ask is that our project would now have to come up with some >> kind of superior data decomposition plan in order to justify proceeding with >> the project. >> >> Thanks >> >> James >> >> >> Quoting Qin Gao <[email protected]>: >> >> Hi James, >>> >>> The GIZA++ is a very typical EM algorithm and probably you want to >>> parallelize the e-step since it takes long time then M-Step. You may want >>> to >>> check out the PGIZA++ and MGIZA++ implementations which you can download >>> in >>> my homepage: >>> >>> http://www.cs.cmu.edu/~qing <http://www.cs.cmu.edu/%7Eqing> >>> >>> And you may also be interested in a paper describing the work: >>> >>> www.aclweb.org/anthology-new/W/W08/W08-0509.pdf >>> >>> Please let me know if there are anything I can help. >>> >>> Best, >>> Qin >>> >>> On Thu, Feb 19, 2009 at 4:12 PM, James Read <[email protected]> >>> wrote: >>> >>> Hi all, >>>> >>>> as the title suggest I am involved in a project which may involve >>>> parallelising the code of Giza++ so that it will run on supercomputers >>>> scalably on n number of processors. This would have obvious benefits >>>> to any researchers making regular use of Giza++ who would like it to >>>> finish in minutes rather than hours. >>>> >>>> The first step of such a project was profiling Giza++ to see where the >>>> executable spends most of its time on a typical run. Such profiling >>>> indicated a number of candidate functions. One of which was >>>> model1::em_loop found in the model1.cpp file. >>>> >>>> In order to parallelise such a function (using OpenMPI) it is >>>> necessary to first come up with some kind of data decomposition >>>> strategy which minimizes the latency of interprocessor communication >>>> but ensures that the parallelisation has no side effects other than >>>> running faster on a number of processors up to some optimal number of >>>> processors where the latency of communication begins to outweigh the >>>> benefits of throwing more processors at the job. >>>> >>>> In order to do this I am trying to gain an understanding of the logic >>>> in the model1::em_loop function. However, intuitive comments are >>>> lacking in the code. Does anyone on this list have a good internal >>>> knowledge of this function? Enough to give a rough outline of the >>>> logic it contains in some kind of readable pseudocode? >>>> >>>> Thanks >>>> >>>> P.S. Apologies to anybody to whom this email was not of interest. >>>> >>>> -- >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>>> >>> >>> >>> -- >>> ========================================== >>> Qin Gao >>> Language Technology Institution >>> Carnegie Mellon University >>> http://geek.kyloo.net >>> >>> ------------------------------------------------------------------------------------ >>> Please help improving NLP articles on Wikipedia >>> ========================================== >>> >>> >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> >> > > > -- > ========================================== > Qin Gao > Language Technology Institution > Carnegie Mellon University > http://geek.kyloo.net > ------------------------------------------------------------------------------------ > Please help improving NLP articles on Wikipedia > ========================================== > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
