Hi,

On Fri, Nov 18, 2011 at 6:00 PM, Miles Osborne <[email protected]> wrote:
> re: not tuning on training data, in principle this shouldn't matter
> (especially if the tuning set is large and/or representative of the
> task).
>
> in reality, Moses will assign far too much weight to these examples,
> at the detriment of the others.  (it will drastically overfit).  this
> is why the tuning and training sets are typically disjoint.  this is a
> standard tactic in NLP and not just Moses.

Ok thanks. Actually I think that reminds me indeed what I learned
years ago on the topic (when I was still in university, in fact
working on these kind of topics, though now that's kind of far away).

[Also, Tom Hoar, forget my questions on what you answer at this point
(when I asked "how do you do so?" and such). I misunderstood the
meaning of your answer! Now with Miles's answer, and rereading your
first one, I understand]

> re:  assigning more weight to certain translations, you have two
> options here.  the first would be to assign more weight to these pairs
> when you run Giza++.  (you can assign per-sentence pair weights at
> this stage).  this is really just a hint and won't guarantee anything.
>  the second option would be to force translations (using the XML
> markup).

I see. Interesting. For what I want, the weights on GIZA++ looks nice.
I'll try to find information on this.

Thanks a lot for the answers.

Jehan

> Miles
>
> On 18 November 2011 08:42, Jehan Pages <[email protected]> wrote:
>> Hi,
>>
>> On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar
>> <[email protected]> wrote:
>>> Jehan, here are my strategies, others may vary.
>>
>> Thanks.
>>
>>> 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++, not
>>> just a convenience for speed. If you make the effort to use the
>>> BerkeleyAligner, this limit disappears.
>>
>> Ok I didn't know this alternative to GIZA++. I see there are some
>> explanation on the website for switching to this aligner. I may give
>> it a try someday then. :-)
>>
>>> 2/ From a statistics and survey methodology point of view, your training
>>> data is a subset of individual samples selected from a whole population
>>> (linguistic domain) so-as to estimate the characteristics of the whole
>>> population. So, duplicates can exist and they play an important role in
>>> determining statistical significance and calculating probabilities. Some
>>> data sources, however, repeat information with little relevance to the
>>> linguistic balance of the whole domain. One example is a web sites with
>>> repetitive menus on every page. Therefore, for our use, we keep duplicates
>>> where we believe they represent a balanced sampling and results we want to
>>> achieve. We remove them when they do not. Not everyone, however, agrees with
>>> this approach.
>>
>> I see. And that confirms my thoughts. I don't know for sure what will
>> be my strategy, but I think that will be keeping them all then, most
>> probably. Making conditional removal like you do is interesting, but
>> that would prove hard to do on our platform as we don't have context
>> on translations stored.
>>
>>> 3/ Yes, none of the data pairs in the tuning set should be present in your
>>> training data. To do so skews the tuning weights to give excellent BLEU
>>> scores on the tuning results, but horrible scores on "real world"
>>> translations.
>>
>> I am not sure I understand what you say. How do you do so? Also why
>> would we want to give horrible score to real world translations? Isn't
>> the point exactly that the tuning data should actually "represent"
>> this real world translations that we want to get close to?
>>
>>
>> 4/ Also I was wondering something else that I just remember. So that
>> will be a fourth question!
>> Suppose in our system, we have some translations we know for sure are
>> very good (all are good but some are supposed to be more like
>> "certified quality"). Is there no way in Moses to give some more
>> weight to some translations in order to influence the system towards
>> quality data (still keeping all data though)?
>>
>> Thanks again!
>>
>> Jehan
>>
>>> Tom
>>>
>>>
>>> On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages <[email protected]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I have a few questions about quality of training and tuning. If anyone
>>>> has any clarifications, that would be nice! :-)
>>>>
>>>> 1/ According to the documentation:
>>>> «
>>>> sentences longer than 100 words (and their corresponding translations)
>>>> have to be eliminated
>>>>   (note that a shorter sentence length limit will speed up training
>>>> »
>>>> So is it only for the sake of training speed or can too long sentences
>>>> end up being a liability in MT quality? In other words, when I finally
>>>> need to train "for real usage", should I really remove long sentences?
>>>>
>>>> 2/ My data is taken from real crowd-sourced translated data. As a
>>>> consequence, we end up with some duplicates (same original text and
>>>> same translation). I wonder if for training, that either doesn't
>>>> matter, or else we should remove duplicates, or finally that's better
>>>> to have duplicates.
>>>>
>>>> I would imagine the latter (keep duplicates) is the best as this is
>>>> "statistical machine learning" and after all, these represent "real
>>>> life" duplicates (text we often encounter and that we apparently
>>>> usually translate the same way) so that would be good to "insist on"
>>>> these translations during training.
>>>> Am I right?
>>>>
>>>> 3/ Do training and tuning data have necessarily to be different? I
>>>> guess for it to be meaningful, it should, and various examples on the
>>>> website seem to go in that way, but I could not read anything clearly
>>>> stating this.
>>>>
>>>> Thanks.
>>>>
>>>> Jehan
>>>>
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to