BTW,

just wanted to pass this on to you guys. As you may have gathered from  
recent questions I've been implementing EM Model 1 in a simple code.

I've got different versions of the code using different data  
structures and I have a few observations which I think you may be  
interested in. Obviously, the central and most performance crucial  
decision is the data structures you use to code your counts, totals  
and probabilities.

For the counts and probabilities I've experimented with a number of  
strategies including using simple 2D arrays, linked lists, hash tables  
and hybrid data structures. From a performance point of view the 2D  
array is the fastest to lookup an e|f entry and update as you can  
address directly. From a memory efficient point of view you can fit  
much more into memory using a linked list or a hash table. The reason  
for this is simply that the 2D array is extremely sparse and many of  
the e|f pairs a complete array would have simply do not exist in any  
corpus i.e. this is a sparse matrix.

However, using a hash table is much slower than using a 2D array. So  
this is the trick I did. I first find all the words in a corpus, count  
their frequencies and then sort them by their frequencies giving the  
most frequent words the lowest ids. The laws of statistics thus make  
the sparse matrix have a dense quadrant in the upper left corner.

I then grow the 2D array in memory to a maximum but making sure that  
it does not overflow into swap space. I then filter the corpus such  
that only sentences which consist of words in the 2D array bounds. The  
results were surprisingly positive. By making such a dense matrix you  
can run EM on a large proportion of sentences with a much smaller 2D  
array (I use a 2D array of structs with a count and a probability for  
each e|f).

The resulting code is *fast*. It can finish a 5 iteration EM Model 1  
on the en-fr europarl corpus in a matter of a few minutes.

Using a 1GB table it can process 890477 of 1555073 sentences
Using a 2GB table it can process 1043215 of 1555073 sentences
Using a 3GB table it can process 1132865 of 1555073 sentences
...
Using a 7GB table it can process 1291927 of 1555073 sentences

My hash table version is not as fast but it can fit the entire set of  
e|f pairs in 3.1GB of memory.

A hybrid code of 2D array and hash table has shown that you get the  
best of both worlds. You can do fast updates to counts and  
probabilities using a 2D table for the most commonly occuring e|f and  
use the hash table for the rest. My 2D table uses pointers to entries  
in the hash table and so really everything is in the hash table. The  
2D array just gives faster lookups to the structs containing the  
counts and probabilities.

Just thought this information might be interesting to some of you as  
GIZA++ seems to be taking the lion's share of the processing time in  
training Moses.

James






-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to