Hi Vincent, This is a different topic, and I'm not completely clear about what exactly you did here. Did you decode the source side of the parallel training data, conduct sentence selection by applying a threshold on the decoder score, and extract a new phrase table from the selected fraction of the original parallel training data? If this is the case, I have some comments:
- Be careful when you translate training data. The system knows these sentences and does things like frequently applying long singleton phrases that have been extracted from the very same sentence. https://aclweb.org/anthology/P/P10/P10-1049.pdf - Longer sentences may have worse model score than shorter sentences. Consider normalizing by sentence length if you use model score for data selection. Difficult sentences generally have worse model score than easy ones but may still be useful for training. You possibly keep the parts of the data that are easy to translate or are highly redundant in the corpus. - You probably see no out-of-vocabulary words (OOVs) when translating training data, or very few of them (depending on word alignment, phrase extraction method, and phrase table pruning), but be aware that if there are OOVs, this may affect the model score a lot. - Check to what extent the sentence selection reduces the vocabulary of your system. Last but not least, two more general comments: - You need dev and test sets that are similar to the type of real-world documents that you're building your system for. Don't tune on Europarl if you eventually want to translate pharmaceutical patents, for instance. Try to collect in-domain training data as well. - In case you have in-domain and out-of-domain training corpora, you can try modified Moore-Lewis filtering for data selection. https://aclweb.org/anthology/D/D11/D11-1033.pdf Cheers, Matthias On Thu, 2015-09-24 at 18:19 +0200, Vincent Nguyen wrote: > This is an interesting subject ...... > > As a matter of fact I have done several tests. > I came up to that need after realizing that even though my results were > good in a "standard dev + test set" situation > I had some strange results with real-world documents. > That's why I investigated. > > But you are right removing some so-called bad entries could have > unexpected results. > > For instance here is a test I did : > > I trained a fr-en model on europarl v7 ( 2 millions sentences) > I tuned with a subset of 3 K sentences. > I ran a evaluation on the full 2 million lines. > then I removed the 90 K sentences for which the score was less than 0.2 > retrained on 1917853 sentences. > > In the end I got more sentences (in %) with a score above 0.2 > but when analyzing at > 0.3 it becomes similar and > 0.4 the initial > corpus is better. > > Just weird. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
