Re: [Moses-support] Various questions about training and tuning

Jehan Pages Sat, 19 Nov 2011 06:28:59 -0800

Hi,

On Fri, Nov 18, 2011 at 11:22 PM, Tom Hoar
<[email protected]> wrote:
> Jehan,
>
> A brute-force method to give some phrases more weight is to simply create
> intentional duplicates in your training data set. Miles' option has more
> finesse.


Though I would definitely try this if I cannot make weight in GIZA++
work, I'd like to try the GIZA way first (unless someone would tell me
that's actually exactly the same thing?).

Yet I have been trying to find the GIZA++ option to do this, but there
is definitely a lack of documentation here, and the web is not very
talkative either on the matter.
A grep on the source code of GIZA++ seems to tell me that only
plain2snt has an option named weight (else the feature has another
naming in other tools of the archive). That apparently works like
this:
---
plain2snt txt1 txt2 [txt3 txt4 -weight w]
 Converts plain text into GIZA++ snt-format
--

But first that's not enough to be sure how it works (though it gives
some hint), but in particular, the train-model.perl does not ever seem
to use this tool (and it is not used internally either by GIZA++
according to its source). Anyway as Moses documentation does not tell
to link this tool in bin/ (so I didn't), I am quite sure it is never
used during training process.
So is there something I am missing? How do one set weights on training
data through GIZA++ (or MGIZA++ as I am trying right now its
multi-thread features)?
Thanks.

Jehan

> Tom
>
> On Fri, 18 Nov 2011 18:29:50 +0900, Jehan Pages <[email protected]> wrote:
>>
>> Hi,
>>
>> On Fri, Nov 18, 2011 at 6:00 PM, Miles Osborne <[email protected]> wrote:
>>>
>>> re: not tuning on training data, in principle this shouldn't matter
>>> (especially if the tuning set is large and/or representative of the
>>> task).
>>>
>>> in reality, Moses will assign far too much weight to these examples,
>>> at the detriment of the others.  (it will drastically overfit).  this
>>> is why the tuning and training sets are typically disjoint.  this is a
>>> standard tactic in NLP and not just Moses.
>>
>> Ok thanks. Actually I think that reminds me indeed what I learned
>> years ago on the topic (when I was still in university, in fact
>> working on these kind of topics, though now that's kind of far away).
>>
>> [Also, Tom Hoar, forget my questions on what you answer at this point
>> (when I asked "how do you do so?" and such). I misunderstood the
>> meaning of your answer! Now with Miles's answer, and rereading your
>> first one, I understand]
>>
>>> re:  assigning more weight to certain translations, you have two
>>> options here.  the first would be to assign more weight to these pairs
>>> when you run Giza++.  (you can assign per-sentence pair weights at
>>> this stage).  this is really just a hint and won't guarantee anything.
>>>  the second option would be to force translations (using the XML
>>> markup).
>>
>> I see. Interesting. For what I want, the weights on GIZA++ looks nice.
>> I'll try to find information on this.
>>
>> Thanks a lot for the answers.
>>
>> Jehan
>>
>>> Miles
>>>
>>> On 18 November 2011 08:42, Jehan Pages <[email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar
>>>> <[email protected]> wrote:
>>>>>
>>>>> Jehan, here are my strategies, others may vary.
>>>>
>>>> Thanks.
>>>>
>>>>> 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++,
>>>>> not
>>>>> just a convenience for speed. If you make the effort to use the
>>>>> BerkeleyAligner, this limit disappears.
>>>>
>>>> Ok I didn't know this alternative to GIZA++. I see there are some
>>>> explanation on the website for switching to this aligner. I may give
>>>> it a try someday then. :-)
>>>>
>>>>> 2/ From a statistics and survey methodology point of view, your
>>>>> training
>>>>> data is a subset of individual samples selected from a whole population
>>>>> (linguistic domain) so-as to estimate the characteristics of the whole
>>>>> population. So, duplicates can exist and they play an important role in
>>>>> determining statistical significance and calculating probabilities.
>>>>> Some
>>>>> data sources, however, repeat information with little relevance to the
>>>>> linguistic balance of the whole domain. One example is a web sites with
>>>>> repetitive menus on every page. Therefore, for our use, we keep
>>>>> duplicates
>>>>> where we believe they represent a balanced sampling and results we want
>>>>> to
>>>>> achieve. We remove them when they do not. Not everyone, however, agrees
>>>>> with
>>>>> this approach.
>>>>
>>>> I see. And that confirms my thoughts. I don't know for sure what will
>>>> be my strategy, but I think that will be keeping them all then, most
>>>> probably. Making conditional removal like you do is interesting, but
>>>> that would prove hard to do on our platform as we don't have context
>>>> on translations stored.
>>>>
>>>>> 3/ Yes, none of the data pairs in the tuning set should be present in
>>>>> your
>>>>> training data. To do so skews the tuning weights to give excellent BLEU
>>>>> scores on the tuning results, but horrible scores on "real world"
>>>>> translations.
>>>>
>>>> I am not sure I understand what you say. How do you do so? Also why
>>>> would we want to give horrible score to real world translations? Isn't
>>>> the point exactly that the tuning data should actually "represent"
>>>> this real world translations that we want to get close to?
>>>>
>>>>
>>>> 4/ Also I was wondering something else that I just remember. So that
>>>> will be a fourth question!
>>>> Suppose in our system, we have some translations we know for sure are
>>>> very good (all are good but some are supposed to be more like
>>>> "certified quality"). Is there no way in Moses to give some more
>>>> weight to some translations in order to influence the system towards
>>>> quality data (still keeping all data though)?
>>>>
>>>> Thanks again!
>>>>
>>>> Jehan
>>>>
>>>>> Tom
>>>>>
>>>>>
>>>>> On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have a few questions about quality of training and tuning. If anyone
>>>>>> has any clarifications, that would be nice! :-)
>>>>>>
>>>>>> 1/ According to the documentation:
>>>>>> «
>>>>>> sentences longer than 100 words (and their corresponding translations)
>>>>>> have to be eliminated
>>>>>>   (note that a shorter sentence length limit will speed up training
>>>>>> »
>>>>>> So is it only for the sake of training speed or can too long sentences
>>>>>> end up being a liability in MT quality? In other words, when I finally
>>>>>> need to train "for real usage", should I really remove long sentences?
>>>>>>
>>>>>> 2/ My data is taken from real crowd-sourced translated data. As a
>>>>>> consequence, we end up with some duplicates (same original text and
>>>>>> same translation). I wonder if for training, that either doesn't
>>>>>> matter, or else we should remove duplicates, or finally that's better
>>>>>> to have duplicates.
>>>>>>
>>>>>> I would imagine the latter (keep duplicates) is the best as this is
>>>>>> "statistical machine learning" and after all, these represent "real
>>>>>> life" duplicates (text we often encounter and that we apparently
>>>>>> usually translate the same way) so that would be good to "insist on"
>>>>>> these translations during training.
>>>>>> Am I right?
>>>>>>
>>>>>> 3/ Do training and tuning data have necessarily to be different? I
>>>>>> guess for it to be meaningful, it should, and various examples on the
>>>>>> website seem to go in that way, but I could not read anything clearly
>>>>>> stating this.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> Jehan
>>>>>>
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> [email protected]
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Various questions about training and tuning

Reply via email to