Another architecture to consider is storing/distributing the ttable
from a single central repository.  Most of the ttable is full of crap,
and for each sentence, you know exactly what parameters will be
required in advance of running your E step.  However, by not
distributing stuff that you don't need, you'll save a lot of effort,
and the amount of memory used by each individual compute node will be
radically reduced.

Chris

On Fri, Feb 20, 2009 at 1:41 PM, Qin Gao <[email protected]> wrote:
> Hi, James,
>
> PGIZA++ is not using OpenMPI, and only use shared storage to transfer model
> files, that could be a bottleneck, MGIZA++ is just using multi-thread. So
> they are not quite complete and can be further improved, the advantage of
> PGIZA++ is it already decomposed every training step (E/M, model 1, hmm
> 3,4,5 etc), so it could be easy for you to understand the logic.
>
> The bottleneck is majorly  the huge T-Table (translation table), which is
> larger during model 1 training and then becomes smaller as training goes on,
> every child has to get the table (typically several gigas for model 1 on
> large data). I think OpenMPI etc is a right way to go, and please let me
> know if you have any question on PGIZA++.
>
> Best,
> Qin
>
> On Thu, Feb 19, 2009 at 4:54 PM, James Read <[email protected]> wrote:
>>
>> Wow!
>>
>> Thanks for that. That was great. I've had a quick read through your paper.
>> I'm guessing the basis of PGiza++ is OpenMPI calls and the basis of MGiza++
>> is OpenMP calls right?
>>
>> Your paper was very fascinating. You mentioned I/O bottlenecks quite a lot
>> with reference to PGiza++ which is to be expected. Did you run any
>> experiments to find what those bottlenecks typically are? How many
>> processors did you hit before you started to lose speed up? Did this number
>> vary for different data sets?
>>
>> Also, you mention breaking up the files into chunks and working on them on
>> different processors. Obviously you're referring to some kind of data
>> decomposition plan. Does your algorithm have any kind of intelligent data
>> decomposition strategy for reducing communications? Or is it just a case of
>> cutting the file up into n bits and assigning each one to a processor?
>>
>> The reason I ask is that our project would now have to come up with some
>> kind of superior data decomposition plan in order to justify proceeding with
>> the project.
>>
>> Thanks
>>
>> James
>>
>> Quoting Qin Gao <[email protected]>:
>>
>>> Hi James,
>>>
>>> The GIZA++ is a very typical EM algorithm and probably you want to
>>> parallelize the e-step since it takes long time then M-Step. You may want
>>> to
>>> check out the PGIZA++ and MGIZA++ implementations which you can download
>>> in
>>> my homepage:
>>>
>>> http://www.cs.cmu.edu/~qing
>>>
>>> And you may also be interested in a paper describing the work:
>>>
>>> www.aclweb.org/anthology-new/W/W08/W08-0509.pdf
>>>
>>> Please let me know if there are anything I can help.
>>>
>>> Best,
>>> Qin
>>>
>>> On Thu, Feb 19, 2009 at 4:12 PM, James Read <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> as the title suggest I am involved in a project which may involve
>>>> parallelising the code of Giza++ so that it will run on supercomputers
>>>> scalably on n number of processors. This would have obvious benefits
>>>> to any researchers making regular use of Giza++ who would like it to
>>>> finish in minutes rather than hours.
>>>>
>>>> The first step of such a project was profiling Giza++ to see where the
>>>> executable spends most of its time on a typical run. Such profiling
>>>> indicated a number of candidate functions. One of which was
>>>> model1::em_loop found in the model1.cpp file.
>>>>
>>>> In order to parallelise such a function (using OpenMPI) it is
>>>> necessary to first come up with some kind of data decomposition
>>>> strategy which minimizes the latency of interprocessor communication
>>>> but ensures that the parallelisation has no side effects other than
>>>> running faster on a number of processors up to some optimal number of
>>>> processors where the latency of communication begins to outweigh the
>>>> benefits of throwing more processors at the job.
>>>>
>>>> In order to do this I am trying to gain an understanding of the logic
>>>> in the model1::em_loop function. However, intuitive comments are
>>>> lacking in the code. Does anyone on this list have a good internal
>>>> knowledge of this function? Enough to give a rough outline of the
>>>> logic it contains in some kind of readable pseudocode?
>>>>
>>>> Thanks
>>>>
>>>> P.S. Apologies to anybody to whom this email was not of interest.
>>>>
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>>
>>>
>>> --
>>> ==========================================
>>> Qin Gao
>>> Language Technology Institution
>>> Carnegie Mellon University
>>> http://geek.kyloo.net
>>>
>>> ------------------------------------------------------------------------------------
>>> Please help improving NLP articles on Wikipedia
>>> ==========================================
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
>
>
> --
> ==========================================
> Qin Gao
> Language Technology Institution
> Carnegie Mellon University
> http://geek.kyloo.net
> ------------------------------------------------------------------------------------
> Please help improving NLP articles on Wikipedia
> ==========================================
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to