Thanks Matthias for the detailed explanation. I think I have most of it in mind except not really understanding how this one works :
"Difficult sentences generally have worse model score than easy ones but may still be useful for training." but yes what you describe is more or less what I did to better understand the mechanism. and I know I have to tune with in domain data for proper end result. Cheers, Vincent Le 24/09/2015 22:13, Matthias Huck a écrit : > Hi Vincent, > > This is a different topic, and I'm not completely clear about what > exactly you did here. Did you decode the source side of the parallel > training data, conduct sentence selection by applying a threshold on the > decoder score, and extract a new phrase table from the selected fraction > of the original parallel training data? If this is the case, I have some > comments: > > > - Be careful when you translate training data. The system knows these > sentences and does things like frequently applying long singleton > phrases that have been extracted from the very same sentence. > https://aclweb.org/anthology/P/P10/P10-1049.pdf > > - Longer sentences may have worse model score than shorter sentences. > Consider normalizing by sentence length if you use model score for data > selection. > Difficult sentences generally have worse model score than easy ones but > may still be useful for training. You possibly keep the parts of the > data that are easy to translate or are highly redundant in the corpus. > > - You probably see no out-of-vocabulary words (OOVs) when translating > training data, or very few of them (depending on word alignment, phrase > extraction method, and phrase table pruning), but be aware that if there > are OOVs, this may affect the model score a lot. > > - Check to what extent the sentence selection reduces the vocabulary of > your system. > > > Last but not least, two more general comments: > > - You need dev and test sets that are similar to the type of real-world > documents that you're building your system for. Don't tune on Europarl > if you eventually want to translate pharmaceutical patents, for > instance. Try to collect in-domain training data as well. > > - In case you have in-domain and out-of-domain training corpora, you can > try modified Moore-Lewis filtering for data selection. > https://aclweb.org/anthology/D/D11/D11-1033.pdf > > > Cheers, > Matthias > > > On Thu, 2015-09-24 at 18:19 +0200, Vincent Nguyen wrote: >> This is an interesting subject ...... >> >> As a matter of fact I have done several tests. >> I came up to that need after realizing that even though my results were >> good in a "standard dev + test set" situation >> I had some strange results with real-world documents. >> That's why I investigated. >> >> But you are right removing some so-called bad entries could have >> unexpected results. >> >> For instance here is a test I did : >> >> I trained a fr-en model on europarl v7 ( 2 millions sentences) >> I tuned with a subset of 3 K sentences. >> I ran a evaluation on the full 2 million lines. >> then I removed the 90 K sentences for which the score was less than 0.2 >> retrained on 1917853 sentences. >> >> In the end I got more sentences (in %) with a score above 0.2 >> but when analyzing at > 0.3 it becomes similar and > 0.4 the initial >> corpus is better. >> >> Just weird. > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
