This glossary includes common terms that are
helpful for new users of statistical machine translation (SMT) and
the open source Moses Decoder project.
|
Term
|
Source
|
Description
|
|
aligned data
|
SMT
|
Aligned data are the elements of a parallel
corpus consisting of two or more languages. Each
element in one language matches the corresponding element in the
other language(s). The elements, sometimes called segments, can be
block-aligned, paragraph-aligned, sentence-aligned, phrase-aligned
or token-aligned.
|
|
alignment
process
|
SMT
|
There are two alignment processes. In corpus
preparation, the alignment process creates aligned
data. During training,
the alignment process uses a program such as MGIZA++ to create
word alignment files.
|
|
BLEU score
|
SMT
|
BLEU stands for Bi-Lingual
Evaluation Understudy”. A BLEU score indicates how
closely the token sequences in one set of data, such as machine
translation output, correlate with (match) the token sequences in
another set of data, such as a reference human translation. See:
evaluation
process
|
|
corpus
preparation
|
SMT
|
Corpus preparation is the general process to
extract, transform, categorize various documents from their
original purpose to and align the resulting data into a parallel
corpus for training a translation model.
|
|
development (dev) set
|
SMT
|
See “tuning set”
|
|
eval set
|
SMT
|
See “test set”
|
|
evaluation
process
|
SMT
|
The evaluation process uses a translation model
of components created in the training
process and configured with the tuning
process to translate several thousand source
language sentences in the eval
set. This process then compares the resulting
machine translations to reference translations, also in the eval
set. The final BLEU
score evaluation report shows how well the machine
translations match the reference translations.
|
|
hierarchical
model
|
SMT
|
SMT translation model that uses hierarchical
training corpus.
|
|
hierarchical
training data
|
SMT
|
A training
corpus with each phrase annotated with the
hierarchical structure of the language, such as parts of speech,
word function, etc.
|
|
language model
|
SMT
|
A “language model” or “lm” is a
statistical description of one language that includes the
frequencies of token-based n-grams
occurrences in a corpus. The “lm” is trained from a large
monolingual corpus and saved as a file. The language model file is
a required component of every translation
model. Moses uses language model to select the most
“probably” target
language sentence from a large set of “possible”
translations it generated using the phrase
table and reordering
table.
|
|
language
model types
|
SMT
|
Language model
files contain statistical data generated by one of various
programs. Moses Decoder can use language model file types
including: KenLM SRILM, RandLM and IRSTLM. SRILM, RandLM and
IRSTLM toolkits include tools that train the new language model
files. KenLM, however, only reads ARPA standard language model
files which can be created by SRILM, IRSTLM. DoMY can create
language models in all these formats and configure Moses Decoder
to use all.
|
|
moses
configuration file
|
SMT
|
The moses configuration file is a text file
created during the tuning
process. The file contains the paths to the phrase
table(s), reordering
table, language
model(s) with other codes and numeric values that
control how the Moses Decoder works.
|
|
n-grams
|
SMT
|
An n-gram is a subsequence of n number of (1,
2, 3, etc) items in a larger sequence. In an lm
n-grams are sequences of tokens.
In phrase tables
and reordering
tables, n-grams are sequences of pairs of source
and target
language tokens.
|
|
parallel corpus
|
SMT
|
See “parallel
data”
|
|
parallel data
|
SMT
|
A linguistic corpus of two or more languages
where each element in one language corresponds to an element with
the same meaning in the other language(s). The original, authored
language is identified as the source language. Non-source
languages are referred to as “target” languages. For Moses
SMT, parallel data takes the form of one source
and one target
language text file where both files contain
corresponding translation of sentences line by line.
|
|
phrase table
|
SMT
|
A “phrase table” is a statistical
description of a parallel corpus of source-target
language sentence pairs. The frequencies that
n-grams
in a source language text co-occur with n-grams in a parallel
target language text represent the probability that those
source-target paired n-grams will occur again in other texts
similar to the parallel corpus. In practical terms, the phrase
table is a file created during the training
process and saved in the translation model folder.
It functions as a sophisticated dictionary between the source and
target languages. Phrase
tables and reordering tables are translation
model components.
|
|
pipeline
|
SMT
|
A “pipeline” is a toolchain
of processes connected by standard streams, so that the output of
each process (stdout) feeds directly as input (stdin) to the next
one.
|
|
recaser model
|
SMT
|
A recaser model is a special translation
model translates lower cased data to “natural”
cased text (upper and lower casing).
|
|
reordering table
|
SMT
|
A “reordering table” contains the
statistical frequencies that describe the changes in word order
between source
and target
languages, such as “big house” versus “house
big”. In practical terms, a “reordering table” is a file
created during the training
process and saved as a file in the model folder.
The reordering table is translation
model components.
|
|
source language
|
SMT
|
The source language is the language of the text
that is to be translated. Typically, this is the authored language
of the text. The source language is the same as the TMX
specification “srclang” attribute of the <tu> tag.
|
|
target language
|
SMT
|
The target language is the language the source
language text should be translated to.
|
|
test set
|
SMT
|
A pair of source
and target
language data, typically containing of several
thousands of pairs used in the evaluation
process.
|
|
tokenization
|
SMT
|
Tokenization is the process of separating words
from punctuation and symbols into tokens.
|
|
tokens
|
SMT
|
Tokens are the basic unit in a machine
translation process. Tokens are a sequence of characters, such as
words, punctuation or symbols, separated by a space. See: BLEU
score
|
|
toolchain
|
SMT
|
A “toolchain” is a series of linked or
“chained” programming tools used in a series where the output
of an upstream tool become the input for a “downstream” tool.
|
|
training corpus
|
SMT
|
A linguistic corpus with parallel
data prepared for training
into the phrase
table and a reordering
table components of a translation
model.
|
|
training data
|
SMT
|
See: training
corpus
|
|
training process
|
SMT
|
Training
is a process in the machine learning branch of artificial
intelligence field. In the training process, a system “learns”
the relationships between parallel
data.
In SMT, the source
language texts
are stimuli that generate the target
language text
as a response. In practical terms, training starts with the bitext
files and creates the phrase
table and
reordering
table that
are components of a translation
model.
|
|
translation
memory
|
SMT
|
A translation memory (tm) is parallel
data that was collected for the purpose of aiding
future translations.
|
|
translation
model
|
SMT
|
A “translation model” consists of one or
more phrase
tables, zero or more reordering
tables, one or more language
models and one moses configuration file that were
created during the training and tuning processes.
|
|
tuning process
|
SMT
|
Tuning is a process that finds the optimized
configuration file
settings for a translation
model when used a specific purpose. The tuning
process translates thousands of source
language phrases in the tuning
set with a translation
model, compares the model's output to a set of reference human
translations, and adjusts the settings with the intention to
improve the translation quality. This process continues through
numerous iterations. With each iteration, the tuning process
repeats the steps until it reaches an optimized translation
quality.
|
|
tuning set
|
SMT
|
A pair of source
and target
language data, typically containing of several
thousands of pairs used in the tuning
process.
|
|
word aligner
|
SMT
|
A word aligner is a program that created word
alignment files during the word
alignment process. Moses currently supports these
word aligners: GIZA++, MGIZA++, and BerkeleyAligner. DoMY uses
MGIZA++ by default.
|
|
word alignment
|
SMT
|
Word alignment process uses a word
aligner to create a word alignment file during the
training
process.
|
|
words
|
SMT
|
A word is the smallest unit of meaning in a
language that will stand on its own. In SMT, a word is a token
created in the tokenization
process that is not a punctuation or symbol.
|