Hello,

I'm trying to build a system where I can take a general speech corpus
for a given language pair and add specific information to it for certain
applications. For instance, I may want to add information about computer
technologies or biology terms to my general corpus. In this way I hope
to be able to easily create domain specific translation modules quickly.
My current system for doing this is quite slow and I have some ideas
about how to improve it but I'm not sure if they're possible.

My current system is as such: I keep a large generalized corpus stored
on my server. When I want to create a new translation module I combine a
smaller corpus with my large corpus and then go through the entire
cleaning and training process with moses. This ends up taking a very
long time when I'm really only adding a small amount of new information.

What I would like to do is to train the large corpus on it's own and
then modify it with the smaller domain specific corpora. I think this
would save me a huge amount of time since I would only ever have to
train the large corpus once. Is this sort of thing at all possible? If
moses keeps track of instance counts of n-grams I think this would be
trivial. I'm just not sure if it actually does that.

Thanks for any help or advice you can provide,
-- 
Taylor Rose
Machine Translation Intern
Language Intelligence
IRC: Handle: trose
     Server: freenode



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to