Another architecture to consider is storing/distributing the ttable from a single central repository. Most of the ttable is full of crap, and for each sentence, you know exactly what parameters will be required in advance of running your E step. However, by not distributing stuff that you don't need, you'll save a lot of effort, and the amount of memory used by each individual compute node will be radically reduced.
Chris On Fri, Feb 20, 2009 at 1:41 PM, Qin Gao <[email protected]> wrote: > Hi, James, > > PGIZA++ is not using OpenMPI, and only use shared storage to transfer model > files, that could be a bottleneck, MGIZA++ is just using multi-thread. So > they are not quite complete and can be further improved, the advantage of > PGIZA++ is it already decomposed every training step (E/M, model 1, hmm > 3,4,5 etc), so it could be easy for you to understand the logic. > > The bottleneck is majorly the huge T-Table (translation table), which is > larger during model 1 training and then becomes smaller as training goes on, > every child has to get the table (typically several gigas for model 1 on > large data). I think OpenMPI etc is a right way to go, and please let me > know if you have any question on PGIZA++. > > Best, > Qin > > On Thu, Feb 19, 2009 at 4:54 PM, James Read <[email protected]> wrote: >> >> Wow! >> >> Thanks for that. That was great. I've had a quick read through your paper. >> I'm guessing the basis of PGiza++ is OpenMPI calls and the basis of MGiza++ >> is OpenMP calls right? >> >> Your paper was very fascinating. You mentioned I/O bottlenecks quite a lot >> with reference to PGiza++ which is to be expected. Did you run any >> experiments to find what those bottlenecks typically are? How many >> processors did you hit before you started to lose speed up? Did this number >> vary for different data sets? >> >> Also, you mention breaking up the files into chunks and working on them on >> different processors. Obviously you're referring to some kind of data >> decomposition plan. Does your algorithm have any kind of intelligent data >> decomposition strategy for reducing communications? Or is it just a case of >> cutting the file up into n bits and assigning each one to a processor? >> >> The reason I ask is that our project would now have to come up with some >> kind of superior data decomposition plan in order to justify proceeding with >> the project. >> >> Thanks >> >> James >> >> Quoting Qin Gao <[email protected]>: >> >>> Hi James, >>> >>> The GIZA++ is a very typical EM algorithm and probably you want to >>> parallelize the e-step since it takes long time then M-Step. You may want >>> to >>> check out the PGIZA++ and MGIZA++ implementations which you can download >>> in >>> my homepage: >>> >>> http://www.cs.cmu.edu/~qing >>> >>> And you may also be interested in a paper describing the work: >>> >>> www.aclweb.org/anthology-new/W/W08/W08-0509.pdf >>> >>> Please let me know if there are anything I can help. >>> >>> Best, >>> Qin >>> >>> On Thu, Feb 19, 2009 at 4:12 PM, James Read <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> >>>> as the title suggest I am involved in a project which may involve >>>> parallelising the code of Giza++ so that it will run on supercomputers >>>> scalably on n number of processors. This would have obvious benefits >>>> to any researchers making regular use of Giza++ who would like it to >>>> finish in minutes rather than hours. >>>> >>>> The first step of such a project was profiling Giza++ to see where the >>>> executable spends most of its time on a typical run. Such profiling >>>> indicated a number of candidate functions. One of which was >>>> model1::em_loop found in the model1.cpp file. >>>> >>>> In order to parallelise such a function (using OpenMPI) it is >>>> necessary to first come up with some kind of data decomposition >>>> strategy which minimizes the latency of interprocessor communication >>>> but ensures that the parallelisation has no side effects other than >>>> running faster on a number of processors up to some optimal number of >>>> processors where the latency of communication begins to outweigh the >>>> benefits of throwing more processors at the job. >>>> >>>> In order to do this I am trying to gain an understanding of the logic >>>> in the model1::em_loop function. However, intuitive comments are >>>> lacking in the code. Does anyone on this list have a good internal >>>> knowledge of this function? Enough to give a rough outline of the >>>> logic it contains in some kind of readable pseudocode? >>>> >>>> Thanks >>>> >>>> P.S. Apologies to anybody to whom this email was not of interest. >>>> >>>> -- >>>> The University of Edinburgh is a charitable body, registered in >>>> Scotland, with registration number SC005336. >>>> >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>> >>> >>> >>> -- >>> ========================================== >>> Qin Gao >>> Language Technology Institution >>> Carnegie Mellon University >>> http://geek.kyloo.net >>> >>> ------------------------------------------------------------------------------------ >>> Please help improving NLP articles on Wikipedia >>> ========================================== >>> >> >> >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> > > > > -- > ========================================== > Qin Gao > Language Technology Institution > Carnegie Mellon University > http://geek.kyloo.net > ------------------------------------------------------------------------------------ > Please help improving NLP articles on Wikipedia > ========================================== > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
