Hi, On Fri, Nov 18, 2011 at 11:22 PM, Tom Hoar <[email protected]> wrote: > Jehan, > > A brute-force method to give some phrases more weight is to simply create > intentional duplicates in your training data set. Miles' option has more > finesse.
Though I would definitely try this if I cannot make weight in GIZA++ work, I'd like to try the GIZA way first (unless someone would tell me that's actually exactly the same thing?). Yet I have been trying to find the GIZA++ option to do this, but there is definitely a lack of documentation here, and the web is not very talkative either on the matter. A grep on the source code of GIZA++ seems to tell me that only plain2snt has an option named weight (else the feature has another naming in other tools of the archive). That apparently works like this: --- plain2snt txt1 txt2 [txt3 txt4 -weight w] Converts plain text into GIZA++ snt-format -- But first that's not enough to be sure how it works (though it gives some hint), but in particular, the train-model.perl does not ever seem to use this tool (and it is not used internally either by GIZA++ according to its source). Anyway as Moses documentation does not tell to link this tool in bin/ (so I didn't), I am quite sure it is never used during training process. So is there something I am missing? How do one set weights on training data through GIZA++ (or MGIZA++ as I am trying right now its multi-thread features)? Thanks. Jehan > Tom > > On Fri, 18 Nov 2011 18:29:50 +0900, Jehan Pages <[email protected]> wrote: >> >> Hi, >> >> On Fri, Nov 18, 2011 at 6:00 PM, Miles Osborne <[email protected]> wrote: >>> >>> re: not tuning on training data, in principle this shouldn't matter >>> (especially if the tuning set is large and/or representative of the >>> task). >>> >>> in reality, Moses will assign far too much weight to these examples, >>> at the detriment of the others. (it will drastically overfit). this >>> is why the tuning and training sets are typically disjoint. this is a >>> standard tactic in NLP and not just Moses. >> >> Ok thanks. Actually I think that reminds me indeed what I learned >> years ago on the topic (when I was still in university, in fact >> working on these kind of topics, though now that's kind of far away). >> >> [Also, Tom Hoar, forget my questions on what you answer at this point >> (when I asked "how do you do so?" and such). I misunderstood the >> meaning of your answer! Now with Miles's answer, and rereading your >> first one, I understand] >> >>> re: assigning more weight to certain translations, you have two >>> options here. the first would be to assign more weight to these pairs >>> when you run Giza++. (you can assign per-sentence pair weights at >>> this stage). this is really just a hint and won't guarantee anything. >>> the second option would be to force translations (using the XML >>> markup). >> >> I see. Interesting. For what I want, the weights on GIZA++ looks nice. >> I'll try to find information on this. >> >> Thanks a lot for the answers. >> >> Jehan >> >>> Miles >>> >>> On 18 November 2011 08:42, Jehan Pages <[email protected]> wrote: >>>> >>>> Hi, >>>> >>>> On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar >>>> <[email protected]> wrote: >>>>> >>>>> Jehan, here are my strategies, others may vary. >>>> >>>> Thanks. >>>> >>>>> 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++, >>>>> not >>>>> just a convenience for speed. If you make the effort to use the >>>>> BerkeleyAligner, this limit disappears. >>>> >>>> Ok I didn't know this alternative to GIZA++. I see there are some >>>> explanation on the website for switching to this aligner. I may give >>>> it a try someday then. :-) >>>> >>>>> 2/ From a statistics and survey methodology point of view, your >>>>> training >>>>> data is a subset of individual samples selected from a whole population >>>>> (linguistic domain) so-as to estimate the characteristics of the whole >>>>> population. So, duplicates can exist and they play an important role in >>>>> determining statistical significance and calculating probabilities. >>>>> Some >>>>> data sources, however, repeat information with little relevance to the >>>>> linguistic balance of the whole domain. One example is a web sites with >>>>> repetitive menus on every page. Therefore, for our use, we keep >>>>> duplicates >>>>> where we believe they represent a balanced sampling and results we want >>>>> to >>>>> achieve. We remove them when they do not. Not everyone, however, agrees >>>>> with >>>>> this approach. >>>> >>>> I see. And that confirms my thoughts. I don't know for sure what will >>>> be my strategy, but I think that will be keeping them all then, most >>>> probably. Making conditional removal like you do is interesting, but >>>> that would prove hard to do on our platform as we don't have context >>>> on translations stored. >>>> >>>>> 3/ Yes, none of the data pairs in the tuning set should be present in >>>>> your >>>>> training data. To do so skews the tuning weights to give excellent BLEU >>>>> scores on the tuning results, but horrible scores on "real world" >>>>> translations. >>>> >>>> I am not sure I understand what you say. How do you do so? Also why >>>> would we want to give horrible score to real world translations? Isn't >>>> the point exactly that the tuning data should actually "represent" >>>> this real world translations that we want to get close to? >>>> >>>> >>>> 4/ Also I was wondering something else that I just remember. So that >>>> will be a fourth question! >>>> Suppose in our system, we have some translations we know for sure are >>>> very good (all are good but some are supposed to be more like >>>> "certified quality"). Is there no way in Moses to give some more >>>> weight to some translations in order to influence the system towards >>>> quality data (still keeping all data though)? >>>> >>>> Thanks again! >>>> >>>> Jehan >>>> >>>>> Tom >>>>> >>>>> >>>>> On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages <[email protected]> >>>>> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I have a few questions about quality of training and tuning. If anyone >>>>>> has any clarifications, that would be nice! :-) >>>>>> >>>>>> 1/ According to the documentation: >>>>>> « >>>>>> sentences longer than 100 words (and their corresponding translations) >>>>>> have to be eliminated >>>>>> (note that a shorter sentence length limit will speed up training >>>>>> » >>>>>> So is it only for the sake of training speed or can too long sentences >>>>>> end up being a liability in MT quality? In other words, when I finally >>>>>> need to train "for real usage", should I really remove long sentences? >>>>>> >>>>>> 2/ My data is taken from real crowd-sourced translated data. As a >>>>>> consequence, we end up with some duplicates (same original text and >>>>>> same translation). I wonder if for training, that either doesn't >>>>>> matter, or else we should remove duplicates, or finally that's better >>>>>> to have duplicates. >>>>>> >>>>>> I would imagine the latter (keep duplicates) is the best as this is >>>>>> "statistical machine learning" and after all, these represent "real >>>>>> life" duplicates (text we often encounter and that we apparently >>>>>> usually translate the same way) so that would be good to "insist on" >>>>>> these translations during training. >>>>>> Am I right? >>>>>> >>>>>> 3/ Do training and tuning data have necessarily to be different? I >>>>>> guess for it to be meaningful, it should, and various examples on the >>>>>> website seem to go in that way, but I could not read anything clearly >>>>>> stating this. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> Jehan >>>>>> >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> [email protected] >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>> >>> >>> >>> -- >>> The University of Edinburgh is a charitable body, registered in >>> Scotland, with registration number SC005336. >>> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
