Thanks Matthias for the detailed explanation.
I think I have most of it in mind except not really understanding how 
this one works :

"Difficult sentences generally have worse model score than easy ones but
may still be useful for training."


but yes what you describe is more or less what I did to better 
understand the mechanism.
and I know I have to tune with in domain data for proper end result.

Cheers,
Vincent

Le 24/09/2015 22:13, Matthias Huck a écrit :
> Hi Vincent,
>
> This is a different topic, and I'm not completely clear about what
> exactly you did here. Did you decode the source side of the parallel
> training data, conduct sentence selection by applying a threshold on the
> decoder score, and extract a new phrase table from the selected fraction
> of the original parallel training data? If this is the case, I have some
> comments:
>
>
> - Be careful when you translate training data. The system knows these
> sentences and does things like frequently applying long singleton
> phrases that have been extracted from the very same sentence.
> https://aclweb.org/anthology/P/P10/P10-1049.pdf
>
> - Longer sentences may have worse model score than shorter sentences.
> Consider normalizing by sentence length if you use model score for data
> selection.
> Difficult sentences generally have worse model score than easy ones but
> may still be useful for training. You possibly keep the parts of the
> data that are easy to translate or are highly redundant in the corpus.
>
> - You probably see no out-of-vocabulary words (OOVs) when translating
> training data, or very few of them (depending on word alignment, phrase
> extraction method, and phrase table pruning), but be aware that if there
> are OOVs, this may affect the model score a lot.
>
> - Check to what extent the sentence selection reduces the vocabulary of
> your system.
>
>
> Last but not least, two more general comments:
>
> - You need dev and test sets that are similar to the type of real-world
> documents that you're building your system for. Don't tune on Europarl
> if you eventually want to translate pharmaceutical patents, for
> instance. Try to collect in-domain training data as well.
>
> - In case you have in-domain and out-of-domain training corpora, you can
> try modified Moore-Lewis filtering for data selection.
> https://aclweb.org/anthology/D/D11/D11-1033.pdf
>
>
> Cheers,
> Matthias
>
>
> On Thu, 2015-09-24 at 18:19 +0200, Vincent Nguyen wrote:
>> This is an interesting subject ......
>>
>> As a matter of fact I have done several tests.
>> I came up to that need after realizing that even though my results were
>> good in a "standard dev + test set" situation
>> I had some strange results with real-world documents.
>> That's why I investigated.
>>
>> But you are right removing some so-called bad entries could have
>> unexpected results.
>>
>> For instance here is a test I did :
>>
>> I trained a fr-en model on europarl v7 ( 2 millions sentences)
>> I tuned with a subset of 3 K sentences.
>> I ran a evaluation on the full 2 million lines.
>> then I removed the 90 K sentences for which the score was less than 0.2
>> retrained on 1917853 sentences.
>>
>> In the end I got more sentences (in %) with a score above 0.2
>> but when analyzing at > 0.3 it becomes similar and > 0.4 the initial
>> corpus is better.
>>
>> Just weird.
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to