I'm happy to announce the availability of a new version of the continuous space
language model (CSLM) toolkit.

Continuous space methods we first introduced by Yoshua Bengio in 2001 [1].
The basic idea of this approach is to project the word indices onto a
continuous space and to use a probability estimator operating on this space.
Since the resulting probability functions are smooth functions of the word
representation, better generalization to unknown events can be expected.  A
neural network can be used to simultaneously learn the projection of the words
onto the continuous space and to estimate the n-gram probabilities.  This is
still a n-gram approach, but the LM probabilities are interpolated for any
possible context of length n-1 instead of backing-off to shorter contexts.

CSLM were initially used in large vocabulary speech recognition systems and more
recently in statistical machine translation. Improvements in the perplexity
between 10 and 20% relative were reported for many languages and tasks.


This version of the CSLM toolkit is a major update of the first release. The
new features include:
- full support for short-lists during training and inference. By these means,
   the CSLM can be applied to tasks with large vocabularies.
 - very efficient n-best list rescoring.
 - support of graphical extension cards (GPU) from Nvidia. This speeds up
training by a factor of four with respect to a high-end server with two CPUs.

We successfully trained CSLMs on large tasks like NIST OpenMT'12. Training on one billion words takes less than 24 hours. In our experiments, the CSLM achieves
improvements in the BLEU score of up to two points with respect to a large
unpruned back-off LM.

A detailed description of the approach can be found in the following publications:

[1] Yoshua Bengio and Rejean Ducharme. A neural probabilistic language model.
    In NIPS, vol 13, pages 932--938, 2001.
[2] Holger Schwenk, Continuous Space Language Models; in Computer Speech and
    Language, volume 21, pages 492-518, 2007.
[3] Holger Schwenk, Continuous Space Language Models For Statistical Machine
Translation; The Prague Bulletin of Mathematical Linguistics, number 83,
    pages 137-146, 2010.
[4] Holger Schwenk, Anthony Rousseau and Mohammed Attik; Large, Pruned or
Continuous Space Language Models on a GPU for Statistical Machine Translation,
    in NAACL workshop on the Future of Language Modeling, June 2012.


The software is available at http://www-lium.univ-lemans.fr/cslm/. It is distributed under GPL v3.

Comments, bug reports, requests for extensions and contributions are welcome.

enjoy,

Holger Schwenk

LIUM
University of Le Mans
holger.schw...@lium.univ-lemans.fr

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to