Hi Vincent,

This is a different topic, and I'm not completely clear about what
exactly you did here. Did you decode the source side of the parallel
training data, conduct sentence selection by applying a threshold on the
decoder score, and extract a new phrase table from the selected fraction
of the original parallel training data? If this is the case, I have some
comments:


- Be careful when you translate training data. The system knows these
sentences and does things like frequently applying long singleton
phrases that have been extracted from the very same sentence.
https://aclweb.org/anthology/P/P10/P10-1049.pdf

- Longer sentences may have worse model score than shorter sentences.
Consider normalizing by sentence length if you use model score for data
selection.
Difficult sentences generally have worse model score than easy ones but
may still be useful for training. You possibly keep the parts of the
data that are easy to translate or are highly redundant in the corpus.

- You probably see no out-of-vocabulary words (OOVs) when translating
training data, or very few of them (depending on word alignment, phrase
extraction method, and phrase table pruning), but be aware that if there
are OOVs, this may affect the model score a lot.

- Check to what extent the sentence selection reduces the vocabulary of
your system.


Last but not least, two more general comments:

- You need dev and test sets that are similar to the type of real-world
documents that you're building your system for. Don't tune on Europarl
if you eventually want to translate pharmaceutical patents, for
instance. Try to collect in-domain training data as well.

- In case you have in-domain and out-of-domain training corpora, you can
try modified Moore-Lewis filtering for data selection. 
https://aclweb.org/anthology/D/D11/D11-1033.pdf


Cheers,
Matthias


On Thu, 2015-09-24 at 18:19 +0200, Vincent Nguyen wrote:
> This is an interesting subject ......
> 
> As a matter of fact I have done several tests.
> I came up to that need after realizing that even though my results were 
> good in a "standard dev + test set" situation
> I had some strange results with real-world documents.
> That's why I investigated.
> 
> But you are right removing some so-called bad entries could have 
> unexpected results.
> 
> For instance here is a test I did :
> 
> I trained a fr-en model on europarl v7 ( 2 millions sentences)
> I tuned with a subset of 3 K sentences.
> I ran a evaluation on the full 2 million lines.
> then I removed the 90 K sentences for which the score was less than 0.2
> retrained on 1917853 sentences.
> 
> In the end I got more sentences (in %) with a score above 0.2
> but when analyzing at > 0.3 it becomes similar and > 0.4 the initial 
> corpus is better.
> 
> Just weird.



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to