Hi,

On Fri, Nov 18, 2011 at 2:59 PM, Tom Hoar
<[email protected]> wrote:
> Jehan, here are my strategies, others may vary.

Thanks.

> 1/ the 100-word (token) limit is a dependency of GIZA++ and MGIZA++, not
> just a convenience for speed. If you make the effort to use the
> BerkeleyAligner, this limit disappears.

Ok I didn't know this alternative to GIZA++. I see there are some
explanation on the website for switching to this aligner. I may give
it a try someday then. :-)

> 2/ From a statistics and survey methodology point of view, your training
> data is a subset of individual samples selected from a whole population
> (linguistic domain) so-as to estimate the characteristics of the whole
> population. So, duplicates can exist and they play an important role in
> determining statistical significance and calculating probabilities. Some
> data sources, however, repeat information with little relevance to the
> linguistic balance of the whole domain. One example is a web sites with
> repetitive menus on every page. Therefore, for our use, we keep duplicates
> where we believe they represent a balanced sampling and results we want to
> achieve. We remove them when they do not. Not everyone, however, agrees with
> this approach.

I see. And that confirms my thoughts. I don't know for sure what will
be my strategy, but I think that will be keeping them all then, most
probably. Making conditional removal like you do is interesting, but
that would prove hard to do on our platform as we don't have context
on translations stored.

> 3/ Yes, none of the data pairs in the tuning set should be present in your
> training data. To do so skews the tuning weights to give excellent BLEU
> scores on the tuning results, but horrible scores on "real world"
> translations.

I am not sure I understand what you say. How do you do so? Also why
would we want to give horrible score to real world translations? Isn't
the point exactly that the tuning data should actually "represent"
this real world translations that we want to get close to?


4/ Also I was wondering something else that I just remember. So that
will be a fourth question!
Suppose in our system, we have some translations we know for sure are
very good (all are good but some are supposed to be more like
"certified quality"). Is there no way in Moses to give some more
weight to some translations in order to influence the system towards
quality data (still keeping all data though)?

Thanks again!

Jehan

> Tom
>
>
> On Fri, 18 Nov 2011 14:31:44 +0900, Jehan Pages <[email protected]> wrote:
>>
>> Hi all,
>>
>> I have a few questions about quality of training and tuning. If anyone
>> has any clarifications, that would be nice! :-)
>>
>> 1/ According to the documentation:
>> «
>> sentences longer than 100 words (and their corresponding translations)
>> have to be eliminated
>>   (note that a shorter sentence length limit will speed up training
>> »
>> So is it only for the sake of training speed or can too long sentences
>> end up being a liability in MT quality? In other words, when I finally
>> need to train "for real usage", should I really remove long sentences?
>>
>> 2/ My data is taken from real crowd-sourced translated data. As a
>> consequence, we end up with some duplicates (same original text and
>> same translation). I wonder if for training, that either doesn't
>> matter, or else we should remove duplicates, or finally that's better
>> to have duplicates.
>>
>> I would imagine the latter (keep duplicates) is the best as this is
>> "statistical machine learning" and after all, these represent "real
>> life" duplicates (text we often encounter and that we apparently
>> usually translate the same way) so that would be good to "insist on"
>> these translations during training.
>> Am I right?
>>
>> 3/ Do training and tuning data have necessarily to be different? I
>> guess for it to be meaningful, it should, and various examples on the
>> website seem to go in that way, but I could not read anything clearly
>> stating this.
>>
>> Thanks.
>>
>> Jehan
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to