Hi Taylor,
The moses-support team does not supported DoMY. I'll answer your
specific DoMY questions separately, but I can share some general
thoughts here.
Totally automating the end-to-end process from .tmx files to a trained
translation model is a challenging task. It's often necessary to insert
break points where localization engineers and linguists can review the
extracted data and identify any inherited corruption from the .tmx data.
In general, DoMY steps to go from .tmx to trained engine would include
running the following "graphs" in the order below. Note: the term
"graph" refers to a parallel toolchain or pipeline that's synchronized
(alignes) data between two or more languages. It comes from the
multi-media term "filter graph", such as Linux's GStreamer & Microsoft's
DirectShow that work on parallel synchronized media streams.
1) domy import-tmx - extracts tmx data to parallel corpora files
(Python)
2) domy clean-corpus - cleans parallel data similar to Moses'
clean-corpus-n.perl. Adds extraction of language model data (Python)
3) domy build-lm - consolidates individual corpus files to master
language model and recaser corpus files (Python)
4) domy build-tm - consolidates individual corpus files to two master
parallel files plus supporting dev/eval sets and .sgm files (Python)
5) train - wrapper for the following sequential steps (Bash scripts)
a) train-lm - trains language model from corpus in (3)
b) train-tables - trains phrase and reorder tables from corpus in
(4)
c) train-tablesbin - binarizes tables from (5b)
d) train-recaser - trains recaser model from corpus in (3)
e) train-mert - tunes a translation model consisting of LM from (5a)
and tables from (5c)
f) train-eval - translates runs mteval-v12.pl from eval sets in (4)
6) domy translate - translates new documents using the engine created
above (Python)
You need to edit/configure the various config.ini files (1-4) and also
issue a proper command line for (5). Renaming directories should not be
necessary if the config.ini's are set up properly.
If you need help, I'll be happy to take that offline from
moses-support.
Tom
On Mon, 12 Sep 2011 10:30:30 -0400, Taylor Rose
<[email protected]> wrote:
> Hey all,
>
> I've been working with Domy for about a week and I'm trying to
> automate
> the process of going from a *.tmx to a trained translation module.
>
> This is my understanding of the sequence so far:
> import-tmx
> rename directories (ie. en/en/data.txt en/nl/data.txt)
> clean-corpus
> sa-champollion to align
> build-tm
> build-lm
> train-lm
> ready to translate?
>
> Is my understanding of this correct? I'd also appreciate help with
> formatting output of graphs. the import-tmx graph outputs a directory
> structure such as '/Test/tm/us_gb/us_gb/nl_nl' but the clean-corpus
> graph expects a structure such as '/Test/tm/en/en/nl'. Is there a way
> to
> modify the output in the config.ini file or should I just write a
> bash
> script to rename everything?
>
> Thanks,
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support