> 1. I agree that I might not have to use any fancy smoothing, but even at Google scale using simple smoothing seems to aid performance (at least for Machine translation) http://acl.ldc.upenn.edu/D/D07/D07-1090.pdf > I said "fancy smoothing", not no smoothing. we actually do on-the-fly Witten-Bell smoothing (and sometimes Stupid-Backoff, which is what you can do with large LMs).
for smaller LMs we do Kneser-Ney. 2. Is your code open source? My ngram code hasn't been released, but it is not hard to do yourself. collecting ngrams and counts is really a generalisation of the standard word counting problem. (to make it more efficient you would need to do in-mapper combining). One thing i have meant to do is deal with skewed sharding. basically, high frequency function words tend to get sent to the same shard and this makes reducing not very well balanced. (to do this you key on the first two words of a gram), rather than just one) 3. I was also looking to understand if there were any efforts to store these large sets optimally for real time access. Can you please point me to effort on hosting LM's using hypertable effort? Currently we store very large LMs in a randomised manner. look here for our source forge release: https://sourceforge.net/projects/randlm/ The associated papers can be found on my homepage, under randomised language modelling: http://www.iccs.informatics.ed.ac.uk/~miles/mt-papers.html The state-of-the-art in large LMs is to use a cluster of machines (ie some kind of bigTable setup) along with a randomised representation. If you store fingerprints for ngrams and quantise your probabilities, you can retrieve each gram in about three hash functions (or less). Over time I have been exploring how to do this. My first attempt used Chord, but that didn't really work-out. We also looked at HBase (ditto). Right now I have a student looking at HypeTable. He has implemented non-blocking I/O (ie you can batch requests, send them off and do something else) and also some tricks to spot when bogus ngram requests are being made across the network. It turns-out that for Machine Translation, the vast majority of ngram requests are for grams that don't exist. Miles -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
