Dear Taylor, To my knowledge, I think it is difficult for moses to do such kind of work.
I'll repeat your questions as the following: 1) corpus A (big one in the general domain), corpus B, C, D, etc. (small one in the specific domain) 2) try to use A+B or A+C or some others to train a domain adaptive system Generally, there are several steps from raw corpus toward a final SMT engine. It may be 1) data cleaning 2) word alignment 3) phrase extraction 4) probability calculation 5) ... For the first step, it may be easy to combine the outputs same way as to combine the inputs. But for others, I think there is no easy way to combine the results. For example, in step 2, GIZA++ needs to go over all the training sentences to find possible alignments. If the corpus is divided into two parts and trained separately. The alignment generated may be not as good as the previous one. The situation is the same for step 4 or others. Or there needs more efforts to do this. Maybe retrain the whole corpus A+B will be the most easiest one. Please let me know if you have better methods, Thanks. On Tue, Mar 20, 2012 at 1:37 AM, Taylor Rose <[email protected]> wrote: > Hello, > > I'm trying to build a system where I can take a general speech corpus > for a given language pair and add specific information to it for certain > applications. For instance, I may want to add information about computer > technologies or biology terms to my general corpus. In this way I hope > to be able to easily create domain specific translation modules quickly. > My current system for doing this is quite slow and I have some ideas > about how to improve it but I'm not sure if they're possible. > > My current system is as such: I keep a large generalized corpus stored > on my server. When I want to create a new translation module I combine a > smaller corpus with my large corpus and then go through the entire > cleaning and training process with moses. This ends up taking a very > long time when I'm really only adding a small amount of new information. > > What I would like to do is to train the large corpus on it's own and > then modify it with the smaller domain specific corpora. I think this > would save me a huge amount of time since I would only ever have to > train the large corpus once. Is this sort of thing at all possible? If > moses keeps track of instance counts of n-grams I think this would be > trivial. I'm just not sure if it actually does that. > > Thanks for any help or advice you can provide, > -- > Taylor Rose > Machine Translation Intern > Language Intelligence > IRC: Handle: trose > Server: freenode > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
