Hi, On Fri, Nov 18, 2011 at 6:00 PM, Miles Osborne <[email protected]> wrote: > re: not tuning on training data, in principle this shouldn't matter > (especially if the tuning set is large and/or representative of the > task). > > in reality, Moses will assign far too much weight to these examples, > at the detriment of the others. (it will drastically overfit). this > is why the tuning and training sets are typically disjoint. this is a > standard tactic in NLP and not just Moses.
Ok thanks. Actually I think that reminds me indeed what I learned years ago on the topic (when I was still in university, in fact working on these kind of topics, though now that's kind of far away). [Also, Tom Hoar, forget my questions on what you answer at this point (when I asked "how do you do so?" and such). I misunderstood the meaning of your answer! Now with Miles's answer, and rereading your first one, I understand] > re: assigning more weight to certain translations, you have two > options here. the first would be to assign more weight to these pairs > when you run Giza++. (you can assign per-sentence pair weights at > this stage). this is really just a hint and won't guarantee anything. > the second option would be to force translations (using the XML > markup). I see. Interesting. For what I want, the weights on GIZA++ looks nice. I'll try to find information on this. Thanks a lot for the answers. Jehan > Miles > > On 18 November 2011 08:42, Jehan Pages <[email protected]> wrote: >> Hi, >> >> On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar >> <[email protected]> wrote: >>> Jehan, here are my strategies, others may vary. >> >> Thanks. >> >>> 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++, not >>> just a convenience for speed. If you make the effort to use the >>> BerkeleyAligner, this limit disappears. >> >> Ok I didn't know this alternative to GIZA++. I see there are some >> explanation on the website for switching to this aligner. I may give >> it a try someday then. :-) >> >>> 2/ From a statistics and survey methodology point of view, your training >>> data is a subset of individual samples selected from a whole population >>> (linguistic domain) so-as to estimate the characteristics of the whole >>> population. So, duplicates can exist and they play an important role in >>> determining statistical significance and calculating probabilities. Some >>> data sources, however, repeat information with little relevance to the >>> linguistic balance of the whole domain. One example is a web sites with >>> repetitive menus on every page. Therefore, for our use, we keep duplicates >>> where we believe they represent a balanced sampling and results we want to >>> achieve. We remove them when they do not. Not everyone, however, agrees with >>> this approach. >> >> I see. And that confirms my thoughts. I don't know for sure what will >> be my strategy, but I think that will be keeping them all then, most >> probably. Making conditional removal like you do is interesting, but >> that would prove hard to do on our platform as we don't have context >> on translations stored. >> >>> 3/ Yes, none of the data pairs in the tuning set should be present in your >>> training data. To do so skews the tuning weights to give excellent BLEU >>> scores on the tuning results, but horrible scores on "real world" >>> translations. >> >> I am not sure I understand what you say. How do you do so? Also why >> would we want to give horrible score to real world translations? Isn't >> the point exactly that the tuning data should actually "represent" >> this real world translations that we want to get close to? >> >> >> 4/ Also I was wondering something else that I just remember. So that >> will be a fourth question! >> Suppose in our system, we have some translations we know for sure are >> very good (all are good but some are supposed to be more like >> "certified quality"). Is there no way in Moses to give some more >> weight to some translations in order to influence the system towards >> quality data (still keeping all data though)? >> >> Thanks again! >> >> Jehan >> >>> Tom >>> >>> >>> On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages <[email protected]> wrote: >>>> >>>> Hi all, >>>> >>>> I have a few questions about quality of training and tuning. If anyone >>>> has any clarifications, that would be nice! :-) >>>> >>>> 1/ According to the documentation: >>>> « >>>> sentences longer than 100 words (and their corresponding translations) >>>> have to be eliminated >>>> (note that a shorter sentence length limit will speed up training >>>> » >>>> So is it only for the sake of training speed or can too long sentences >>>> end up being a liability in MT quality? In other words, when I finally >>>> need to train "for real usage", should I really remove long sentences? >>>> >>>> 2/ My data is taken from real crowd-sourced translated data. As a >>>> consequence, we end up with some duplicates (same original text and >>>> same translation). I wonder if for training, that either doesn't >>>> matter, or else we should remove duplicates, or finally that's better >>>> to have duplicates. >>>> >>>> I would imagine the latter (keep duplicates) is the best as this is >>>> "statistical machine learning" and after all, these represent "real >>>> life" duplicates (text we often encounter and that we apparently >>>> usually translate the same way) so that would be good to "insist on" >>>> these translations during training. >>>> Am I right? >>>> >>>> 3/ Do training and tuning data have necessarily to be different? I >>>> guess for it to be meaningful, it should, and various examples on the >>>> website seem to go in that way, but I could not read anything clearly >>>> stating this. >>>> >>>> Thanks. >>>> >>>> Jehan >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
