BTW, just wanted to pass this on to you guys. As you may have gathered from recent questions I've been implementing EM Model 1 in a simple code.
I've got different versions of the code using different data structures and I have a few observations which I think you may be interested in. Obviously, the central and most performance crucial decision is the data structures you use to code your counts, totals and probabilities. For the counts and probabilities I've experimented with a number of strategies including using simple 2D arrays, linked lists, hash tables and hybrid data structures. From a performance point of view the 2D array is the fastest to lookup an e|f entry and update as you can address directly. From a memory efficient point of view you can fit much more into memory using a linked list or a hash table. The reason for this is simply that the 2D array is extremely sparse and many of the e|f pairs a complete array would have simply do not exist in any corpus i.e. this is a sparse matrix. However, using a hash table is much slower than using a 2D array. So this is the trick I did. I first find all the words in a corpus, count their frequencies and then sort them by their frequencies giving the most frequent words the lowest ids. The laws of statistics thus make the sparse matrix have a dense quadrant in the upper left corner. I then grow the 2D array in memory to a maximum but making sure that it does not overflow into swap space. I then filter the corpus such that only sentences which consist of words in the 2D array bounds. The results were surprisingly positive. By making such a dense matrix you can run EM on a large proportion of sentences with a much smaller 2D array (I use a 2D array of structs with a count and a probability for each e|f). The resulting code is *fast*. It can finish a 5 iteration EM Model 1 on the en-fr europarl corpus in a matter of a few minutes. Using a 1GB table it can process 890477 of 1555073 sentences Using a 2GB table it can process 1043215 of 1555073 sentences Using a 3GB table it can process 1132865 of 1555073 sentences ... Using a 7GB table it can process 1291927 of 1555073 sentences My hash table version is not as fast but it can fit the entire set of e|f pairs in 3.1GB of memory. A hybrid code of 2D array and hash table has shown that you get the best of both worlds. You can do fast updates to counts and probabilities using a 2D table for the most commonly occuring e|f and use the hash table for the rest. My 2D table uses pointers to entries in the hash table and so really everything is in the hash table. The 2D array just gives faster lookups to the structs containing the counts and probabilities. Just thought this information might be interesting to some of you as GIZA++ seems to be taking the lion's share of the processing time in training Moses. James -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
