Lane Schwartz <dowobeha@...> writes:
>
> I have a number of distinct monolingual corpora. I've been training them
as separate LMs. I now want to run a variant where they are all concatenated
together, and then trained as a single LM. The EMS walkthrough says this
should be possible
(http://www.statmt.org/moses/?n=FactoredTraining.EMS#ntoc19), but doesn't
give the requisite syntax. What is the EMS syntax to do this?
>
> Thanks,
> Lane
Hi Lane,
I tried to do solve the problem quickly on Monday, but that didn't turn out
too well (see the next few commits fixing bugs with it). I was also unhappy
that I couldn't have multiple CONCATENATED-LMs on the same corpus, or define
which corpora to concatenate. This implementation solves that. Assume you
have these two LMs defined:
[LM:parallelA]
raw-corpus = /some/path
[LM:parallelB]
raw-corpus = /some/path
order = 5
we can have a second LM trained on the data of parallelA, but with different
settings, like this:
[LM:parallelA2]
stripped-corpus = [LM:parallelA:stripped-corpus]
exclude-from-interpolation = true
order = 6
[this was actually possible before, but I've added the property
'exclude-from-interpolation', which tells INTERPOLATED-LM to skip this LM.]
If you want an LM on concatenated data, you can define it like this:
[LM:parallelAB]
concatenate-files = [LM:{parallelA,parallelB}:stripped-corpus]
exclude-from-interpolation = true
finally, you can also use 'custom-training' train a language model that
train-model.perl doesn't know about, like NPLM. You'll also have to define
how the model should be added to the moses.ini:
[LM:parallelAB]
stripped-corpus = [LM:parallelAB:stripped-corpus]
custom-training = "my_training_script.sh -order 5 -some_setting 8"
config-feature-line = "NPLM path=/some/path order=5 some-setting=8"
config-weight-line = "NPLM0= 0.1"
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support