>
1. I agree that I might not have to use any fancy smoothing, but even
at Google scale using simple smoothing seems to aid performance (at
least for Machine translation)
http://acl.ldc.upenn.edu/D/D07/D07-1090.pdf
>
I said "fancy smoothing", not no smoothing.  we actually do on-the-fly
Witten-Bell smoothing (and sometimes Stupid-Backoff, which is what you
can do with large LMs).

for smaller LMs we do Kneser-Ney.

2. Is your code open source?

My ngram code hasn't been released, but it is not hard to do yourself.
 collecting ngrams and counts is really a generalisation of the
standard word counting problem.
(to make it more efficient you would need to do in-mapper combining).

One thing i have meant to do is deal with skewed sharding.  basically,
high frequency function words tend to get sent to the same shard and
this makes reducing not very well balanced.
(to do this you key on the first two words of a gram), rather than just one)

3. I was also looking to understand if there were any efforts to store
these large sets optimally for real time access. Can you please point
me to effort on hosting LM's using hypertable effort?

Currently we store very large LMs in a randomised manner.  look here
for our source forge release:

https://sourceforge.net/projects/randlm/

The associated papers can be found on my homepage, under randomised
language modelling:

http://www.iccs.informatics.ed.ac.uk/~miles/mt-papers.html

The state-of-the-art in large LMs is to use a cluster of machines (ie
some kind of bigTable setup) along with a randomised representation.
If you store fingerprints for ngrams and quantise your probabilities,
you can retrieve each gram in about three hash functions (or less).
Over time I have been exploring how to do this.  My first attempt used
Chord, but that didn't really work-out.  We also looked at HBase
(ditto). Right now I have a student looking at HypeTable.  He has
implemented non-blocking I/O (ie you can batch requests, send them off
and do something else) and also some tricks to spot when bogus ngram
requests are being made across the network.
It turns-out that for Machine Translation, the vast majority of ngram
requests are for grams that don't exist.


Miles
-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Reply via email to