Hi, I'd say non-factored PBMT is a strong enough baseline already. Depending on the size of the data you have, tFaT-FaT (to use my notation from the 'taxonomy' paper) should work slightly better. Note that this setup is actually non-factored again, except for using two target-side LMs.
The combinatorial explosion is not harmful in very small data settings only. I'm curious to see your results, but let's take that off-line. O. On 06/10/2012 10:53 PM, Marcin Junczys-Dowmunt wrote: > Hi Ondrej, > thank you for the literature pointers. I am trying to build an as good > as possible baseline system for English-to-Polish SMT using factors. > Until now it is rather frustrating, but from what I have seen in the > literature that's not really surprising. Later I hope to come up with > something clever to improve upon that. > > Among the work you have dedicated to the problem of Czech morphology in > English-to-Czech MT, which system would you recommend for such a baseline? > Best, > Marcin > > On 10.06.2012 22:16, Ondrej Bojar wrote: >> Hi, Marcin, >> >> yes, the root of the trouble is that all possibilities are multiplied. >> Cube pruning can be considered "just a clever speedup" (a very clever, >> of course), but I think implementing a similar thing would not be very >> useful here. This of course depends on what you actually use the factors >> for, but if you use them for morphology or anything that has to get >> support for a particular choice in the context, you can't avoid the >> problem. >> >> Consider a noun: using a few factored steps, you can easily produce >> translation options for all potential cases. But without the context of >> the preceding verb, preposition or e.g. adjective, you can't pick the >> correct one. So pruning of the translation options for the noun is >> likely to prevent you from getting the agreement right. I've run into >> this issue a few times already (most recently this year, >> http://aclweb.org/anthology-new/W/W12/W12-3130.pdf) and I've tried >> circumventing it using a two-step approach, which postpones the >> morphological explosion to a separate search (where lemmas are already >> chosen). Needless to say, Alex Fraser (in the follow-up work of >> http://www.statmt.org/wmt09/pdf/WMT-0920.pdf) was somewhat more >> successful. >> >> So you don't want to just limit the number of options, what you actually >> want is to select the good ones... >> >> O. >> >> On 06/10/2012 08:21 PM, Marcin Junczys-Dowmunt wrote: >>> Hi Ondrej, >>> The blow-up is happening in "DecodeStepGeneration::Process(...)", right? >>> If understand the code correctly from a first glance, all possibilities >>> are simply multiplied. And indeed, there seems to be no way to limit the >>> number of combinations in this step. Could something like Cube-Pruning >>> work here to limit the number of options right from the beginning? >>> Best, >>> Marcin >>> >>> On 10.06.2012 19:02, Ondrej Bojar wrote: >>>> Dear Marcin, >>>> >>>> the short answer is: you need to avoid the blow-up. >>>> >>>> The options that affect pruning during creation of translation >>>> options are: >>>> >>>> -ttable-limit ...how many variants of a phrase to read from the phrase >>>> table) >>>> >>>> -max-partial-trans-opt ...how many partial translation options are >>>> considered for a span. This is the critical pruning to contain the >>>> blowup in memory. >>>> >>>> -max-trans-opt-per-coverage ...how many finished options should be then >>>> passed to the search. >>>> -translation-option-threshold ...the same thing, but expressed relative >>>> to the score of the best one. >>>> >>>> If you set the model so that it does blow up but you don't thrash your >>>> machine by setting -max-partial-trans-opt reasonably low, you are very >>>> likely to get a lot of search errors because the pruning of translation >>>> options happens too early, without the linear context of surrounding >>>> translation options. Moses simply does not have good means to handle >>>> the >>>> combinatorics of factored models. >>>> >>>> Cheers, Ondrej. >>>> >>>> On 06/10/2012 06:40 PM, Marcin Junczys-Dowmunt wrote: >>>>> Hi, >>>>> by the way, are there some best-practice decoder settings for heavily >>>>> factored models with combinatorial blow-up? If I am not wrong, most >>>>> settings affect hypothesis recombination later on. Here the heavy work >>>>> happens during the creation of target phrases and future score >>>>> calculation before the actual translation. >>>>> Best, >>>>> Marcin >>>>> >>>>> W dniu 09.06.2012 16:45, Philipp Koehn pisze: >>>>>> Hi, >>>>>> >>>>>> the idea here was to create a link between the >>>>>> words and POS tags early on and use this as >>>>>> an additional scoring function. But if you see better >>>>>> performance with your setting, please report back. >>>>>> >>>>>> -phi >>>>>> >>>>>> On Fri, Jun 8, 2012 at 6:03 PM, Marcin Junczys-Dowmunt >>>>>> <[email protected]> wrote: >>>>>>> Hi all, >>>>>>> I have a question concerning the "Tutorial for Using Factored >>>>>>> Models", >>>>>>> section on "Train a morphological analysis and generation model". >>>>>>> >>>>>>> The following translation factors and generation factors are trained >>>>>>> for >>>>>>> the given example corpus: >>>>>>> >>>>>>> --translation-factors 1-1+3-2 \ >>>>>>> --generation-factors 1-2+1,2-0 \ >>>>>>> --decoding-steps t0,g0,t1,g1 >>>>>>> >>>>>>> What is the advantage of using the first generation factor 1-2 >>>>>>> compared >>>>>>> to the configuration below? >>>>>>> >>>>>>> --translation-factors 1-1+3-2 \ >>>>>>> --generation-factors 1,2-0 \ >>>>>>> --decoding-steps t0,t1,g1 >>>>>>> >>>>>>> I understand the 1-2 generation factor maps lemmas to POS+morph >>>>>>> information, but the same information is also generated by the 3-2 >>>>>>> translation factor. Apart from that this generation factor >>>>>>> introduces >>>>>>> huge combinatorial blow-up, since every lemma can be mapped to >>>>>>> basically >>>>>>> every possible morphological information seen for this lemma. >>>>>>> _______________________________________________ >>>>>>> Moses-support mailing list >>>>>>> [email protected] >>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>> >>>>> >>>>> >>>> >>> >>> >> > > -- Ondrej Bojar (mailto:[email protected] / [email protected]) http://www.cuni.cz/~obo _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
