This is an interesting subject ...... As a matter of fact I have done several tests. I came up to that need after realizing that even though my results were good in a "standard dev + test set" situation I had some strange results with real-world documents. That's why I investigated.
But you are right removing some so-called bad entries could have unexpected results. For instance here is a test I did : I trained a fr-en model on europarl v7 ( 2 millions sentences) I tuned with a subset of 3 K sentences. I ran a evaluation on the full 2 million lines. then I removed the 90 K sentences for which the score was less than 0.2 retrained on 1917853 sentences. In the end I got more sentences (in %) with a score above 0.2 but when analyzing at > 0.3 it becomes similar and > 0.4 the initial corpus is better. Just weird. Le 24/09/2015 16:42, Matthias Huck a écrit : > Hi, > > If your analysis revealed that there's an issue with only a few specific > entries, then write regular expressions and grep them out. However, you > risk that those entries are a problem on the devtest set you're looking > at only, whereas on different input data it'll be other bad translation > options which pop up. > > On Thu, 2015-09-24 at 16:08 +0200, Vincent Nguyen wrote: >> Matthias, >> >> Pruning : >> I use the cube pop limit at 400 instead of default values (1000 or 5000) >> I use the MinScore 0.001 > It seems to me that something like MinScore 2:0.001 should be effective > for most of the bad phrases you copied into your original mail as an > example. > >> I tried sigtest filtering once, it never worked. > Why not? > >> table-limit=20 >> I have the feeling this is only for CreateOnDiskPt >> am I wrong ? >> does it work with ProcessPhrasetableMin ? > I think it works. The decoder does this, not the phrase table binarizer. > You could run a simple experiments in order to verify. Add > -feature-overwrite 'TranslationModel0 table-limit=20' (or equivalent) to > your decoder call. > > Cheers, > Matthias > > >> Le 24/09/2015 15:21, Matthias Huck a écrit : >>> Hi Vincent, >>> >>> Pruning the phrase table will discard many bad entries. >>> >>> The decoder is typically configured to load no more than a maximum >>> number of translation options per distinct source side. Use >>> table-limit=20 as a parameter to your translation model feature to limit >>> the amount of candidates to the top 20. >>> >>> Alternatively you can pre-prune the phrase table. The following page >>> provides instructions: >>> http://www.statmt.org/moses/?n=Advanced.RuleTables >>> >>> In case you want to remove just a handful of individual entries, I >>> recommend grep -v on the Linux command line. >>> >>> Cheers, >>> Matthias >>> >>> >>> On Thu, 2015-09-24 at 11:05 +0100, Hieu Hoang wrote: >>>> i've just added a new feature function that allows you to give a list >>>> of rules that you don't want to be used: >>>> " 1 ||| One Million Roofs >>>> >>>> oui ||| no >>>> >>>> To use this list, add the following to your moses.ini file >>>> >>>> [feature] >>>> DeleteRules path=/path/to/list >>>> >>>> Not tested. >>>> >>>> >>>> >>>> Hieu Hoang >>>> http://www.hoang.co.uk/hieu >>>> >>>> >>>> On 24 September 2015 at 10:11, Vincent Nguyen <[email protected]> wrote: >>>> >>>> well at times it does, the sequence: >>>> " 1 " >>>> became >>>> One Million Roofs >>>> completely off .... >>>> >>>> >>>> " 1 " . ||| one . ||| 4.77044e-05 2.56689e-08 >>>> 0.103519 0.0135382 ||| 1-0 3-1 ||| 2170 1 1 ||| ||| >>>> " 1 " une ||| " 1 " meaning ||| 0.0517593 >>>> 0.00140486 0.103519 5.98457e-06 ||| 0-0 1-1 0-2 2-2 2-3 ||| 2 >>>> 1 1 ||| ||| >>>> " 1 " ||| " 1 " meaning ||| 0.0517593 >>>> 0.121628 0.0517593 5.98457e-06 ||| 0-0 1-1 0-2 2-2 2-3 ||| 2 2 >>>> 1 ||| ||| >>>> " 1 " ||| one ||| 1.34779e-06 2.65512e-08 0.0517593 >>>> 0.0141179 ||| 1-0 ||| 76806 2 1 ||| ||| >>>> " 1 + ||| ' one @-@ on ||| 0.0517593 8.76241e-09 >>>> 0.0345062 2.43009e-07 ||| 0-0 2-0 1-1 ||| 2 3 1 ||| ||| >>>> " 1 + ||| ' one @-@ ||| 0.0129398 8.76241e-09 >>>> 0.0345062 1.65217e-05 ||| 0-0 2-0 1-1 ||| 8 3 1 ||| ||| >>>> " 1 + ||| ' one ||| 0.000685554 8.76241e-09 >>>> 0.0345062 0.00189493 ||| 0-0 2-0 1-1 ||| 151 3 1 ||| ||| >>>> " 1 . ||| '1 . ||| 0.103519 0.241693 0.0345062 >>>> 5.37965e-05 ||| 0-0 1-0 2-1 ||| 1 3 1 ||| ||| >>>> " 1 . ||| " 1 . ||| 0.508332 0.34958 0.338888 >>>> 0.180103 ||| 0-0 1-1 2-2 ||| 2 3 2 ||| ||| >>>> " 1 billion de dollars ||| $ 1 trillion of ||| 0.0207037 >>>> 2.46862e-05 0.103519 0.0679424 ||| 4-0 1-1 2-2 3-3 ||| 5 1 1 >>>> ||| ||| >>>> " 1 billion de ||| 1 trillion of ||| 0.0345062 >>>> 5.93019e-05 0.103519 0.161697 ||| 1-0 2-1 3-2 ||| 3 1 1 ||| >>>> ||| >>>> " 1 billion ||| 1 trillion ||| 0.00108967 0.000131965 >>>> 0.103519 0.536768 ||| 1-0 2-1 ||| 95 1 1 ||| ||| >>>> " 1 milliard $ , ||| $ 1 billion ||| 0.00199074 >>>> 2.23776e-06 0.103519 0.420148 ||| 3-0 1-1 2-2 ||| 52 1 1 ||| >>>> ||| >>>> " 1 milliard $ ||| $ 1 billion ||| 0.00199074 3.32223e-05 >>>> 0.103519 0.420148 ||| 3-0 1-1 2-2 ||| 52 1 1 ||| ||| >>>> " 1 milliard d' euros ||| EUR 1 billion ||| >>>> 0.00026749 3.23583e-05 0.103519 0.179568 ||| 4-0 1-1 2-2 3-2 >>>> ||| 387 1 1 ||| ||| >>>> " 1 milliard d' ||| 1 billion ||| 0.000137475 >>>> 6.11551e-05 0.103519 0.25129 ||| 1-0 2-1 3-1 ||| 753 1 1 ||| >>>> ||| >>>> " 1 milliard de dollars ||| $ 1 billion ||| 0.0195512 >>>> 2.47433e-05 0.508332 0.105231 ||| 0-0 4-0 1-1 2-2 ||| 52 2 2 >>>> ||| ||| >>>> " 1 milliard de personnes ||| one billion people ||| >>>> 0.00252484 9.77577e-09 0.103519 0.00258395 ||| 2-0 1-1 2-1 4-2 >>>> ||| 41 1 1 ||| ||| >>>> " 1 milliard de ||| 1 billion of ||| 0.00941078 >>>> 0.000159942 0.0517593 0.15086 ||| 1-0 2-1 3-2 ||| 11 2 1 ||| >>>> ||| >>>> " 1 milliard de ||| one billion ||| 0.000509944 >>>> 4.32371e-08 0.0517593 0.00492989 ||| 2-0 1-1 2-1 ||| 203 2 1 >>>> ||| ||| >>>> " 1 milliard ||| 1 billion ||| 0.0026678 0.000355919 >>>> 0.502213 0.500792 ||| 1-0 2-1 ||| 753 4 3 ||| ||| >>>> " 1 milliard ||| one billion ||| 0.000509944 3.43309e-07 >>>> 0.0258796 0.00492989 ||| 2-0 1-1 2-1 ||| 203 4 1 ||| ||| >>>> " 1 million $ ||| $ 1 million ||| 0.0172531 1.31973e-05 >>>> 0.103519 0.221619 ||| 0-0 3-0 1-1 2-2 ||| 6 1 1 ||| ||| >>>> " 1 million de toits ||| one million solar roofs ||| >>>> 0.0517593 5.86831e-10 0.103519 1.43348e-10 ||| 2-0 1-1 4-3 ||| >>>> 2 1 1 ||| ||| >>>> " 1 million de ||| one million solar ||| 0.0258796 >>>> 9.85876e-10 0.0517593 3.44036e-10 ||| 2-0 1-1 ||| 4 2 1 ||| >>>> ||| >>>> " 1 million de ||| one million ||| 0.00021344 9.85876e-10 >>>> 0.0517593 0.000202374 ||| 2-0 1-1 ||| 485 2 1 ||| ||| >>>> " 1 million ||| one million solar ||| 0.0258796 >>>> 7.82802e-09 0.0517593 3.44036e-10 ||| 2-0 1-1 ||| 4 2 1 ||| >>>> ||| >>>> " 1 million ||| one million ||| 0.00021344 7.82802e-09 >>>> 0.0517593 0.000202374 ||| 2-0 1-1 ||| 485 2 1 ||| ||| >>>> " 1 ou 2 % ||| one or two percent ||| 0.0258796 >>>> 6.85867e-09 0.103519 1.36871e-06 ||| 1-0 2-1 3-2 4-3 ||| 4 1 1 >>>> ||| ||| >>>> " 1 ou 2 ||| one or two ||| 0.000164315 2.30435e-08 >>>> 0.103519 0.00032742 ||| 1-0 2-1 3-2 ||| 630 1 1 ||| ||| >>>> " 1 ou ||| one or ||| 8.83264e-05 3.76903e-06 0.103519 >>>> 0.0112293 ||| 1-0 2-1 ||| 1172 1 1 ||| ||| >>>> " 1 seul coup , ||| ' 1 shot , ||| 0.103519 >>>> 1.88862e-06 0.103519 0.00165224 ||| 0-0 1-1 3-2 4-3 ||| 1 1 1 >>>> ||| ||| >>>> " 1 seul coup ||| ' 1 shot ||| 0.103519 2.45247e-06 >>>> 0.103519 0.00222575 ||| 0-0 1-1 3-2 ||| 1 1 1 ||| ||| >>>> " 1 seul ||| ' 1 ||| 0.0129398 2.78897e-05 0.103519 >>>> 0.214656 ||| 0-0 1-1 ||| 8 1 1 ||| ||| >>>> " 1 ||| ' 1 ||| 0.127083 0.278063 0.0391025 0.214656 >>>> ||| 0-0 1-1 ||| 8 26 2 ||| ||| >>>> " 1 ||| '1 ||| 0.103519 0.25 0.00398148 5.61e-05 ||| >>>> 0-0 1-0 ||| 1 26 1 ||| ||| >>>> " 1 ||| " 1 ||| 0.503492 0.361595 0.11619 0.187815 >>>> ||| 0-0 1-1 ||| 6 26 4 ||| ||| >>>> " 1 ||| 1 ||| 0.0010136 0.00278649 0.461538 0.805151 ||| >>>> 1-0 ||| 11839 26 12 ||| ||| >>>> " 1 ||| One Million Roofs ||| 0.103519 0.00213892 >>>> 0.00398148 3.32314e-15 ||| 0-0 1-0 0-1 0-2 ||| 1 26 1 ||| ||| >>>> " 1 ||| hardly 1 ||| 0.0258796 0.00278649 0.00398148 >>>> 1.73108e-05 ||| 1-1 ||| 4 26 1 ||| ||| >>>> " 1 ||| million solar ||| 0.0345062 3.55949e-06 >>>> 0.00398148 3.29783e-09 ||| 1-0 ||| 3 26 1 ||| ||| >>>> " 1 ||| million ||| 5.83433e-06 3.55949e-06 0.00398148 >>>> 0.0019399 ||| 1-0 ||| 17743 26 1 ||| ||| >>>> " 1 ||| of 1 ||| 0.000263406 0.00278649 0.00398148 >>>> 0.0270917 ||| 1-1 ||| 393 26 1 ||| ||| >>>> " 1 ||| one ||| 1.32368e-05 5.22671e-06 0.0391025 >>>> 0.0141179 ||| 1-0 ||| 76806 26 2 ||| ||| >>>> " 1,1 % ||| 1.1 % ||| 0.0022504 0.00241746 0.103519 >>>> 0.875731 ||| 1-0 2-1 ||| 46 1 1 ||| ||| >>>> " 1,1 milliard d' euros ||| EUR 1.1 billion ||| >>>> 0.00544835 6.98053e-05 0.0517593 0.110019 ||| 3-0 4-0 1-1 2-1 >>>> 2-2 ||| 19 2 1 ||| ||| >>>> " 1,1 milliard d' euros ||| by EUR 1.1 billion ||| >>>> 0.0345062 6.98053e-05 0.0517593 0.000791519 ||| 3-1 4-1 1-2 >>>> 2-2 2-3 ||| 3 2 1 ||| ||| >>>> >>>> >>>> >>>> Le 24/09/2015 09:54, Felipe Sánchez Martínez a écrit : >>>> >>>> > Hi, >>>> > >>>> > This is quite common. If you look at the scores, they are >>>> > pretty low when they do not make sense, so, even though they >>>> > are in the phrase table, most probably they will never be >>>> > used for translation. I would not bother. >>>> > >>>> > Cheers >>>> > -- >>>> > Felipe >>>> > >>>> > El 23/09/15 a las 16:50, Vincent Nguyen escribió: >>>> > > I agree and would like to. >>>> > > But this is tricky, look at the first 30 lines of my >>>> > > phrase table below. >>>> > > >>>> > > and this happens a lot in the first line of tables where >>>> > > there are &apos >>>> > > or weird codes, EN/FR pairs do not match. >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > ! ! ! ! ||| ! ! ! ! ||| 0.103413 0.132185 0.103413 >>>> > > 0.401758 ||| 0-0 1-1 >>>> > > 2-2 3-3 ||| 1 1 1 ||| ||| >>>> > > ! ! ! ) ||| ! ! ! ) ||| 0.339323 0.167884 0.508985 0.4246 >>>> > > ||| 0-0 1-0 >>>> > > 2-0 2-1 2-2 3-3 ||| 3 2 2 ||| ||| >>>> > > ! ! ! ||| ! ! ! ||| 0.501834 0.219223 0.716905 0.50463 ||| >>>> > > 0-0 1-1 2-2 >>>> > > ||| 10 7 6 ||| ||| >>>> > > ! ! ! ||| budget ! ! ! ||| 0.0517067 0.219223 0.0147733 >>>> > > 4.50635e-05 ||| >>>> > > 0-1 1-2 2-3 ||| 2 7 1 ||| ||| >>>> > > ! ! ) , ||| ! ! ) - , ||| 0.103413 0.111989 0.103413 >>>> > > 0.00192967 ||| 0-0 >>>> > > 1-1 2-2 3-3 3-4 ||| 1 1 1 ||| ||| >>>> > > ! ! ) ||| ! ! ) ||| 0.103413 0.278429 0.103413 0.533321 >>>> > > ||| 0-0 1-1 2-2 >>>> > > ||| 1 1 1 ||| ||| >>>> > > ! ! ||| ! ! ||| 0.625 0.363573 0.769231 0.633844 ||| 0-0 >>>> > > 1-1 ||| 16 13 >>>> > > 10 ||| ||| >>>> > > ! ! ||| . ||| 4.65922e-08 6.71089e-07 0.00795487 0.140779 >>>> > > ||| 0-0 1-0 >>>> > > ||| 2.21954e+06 13 1 ||| ||| >>>> > > ! ! ||| budget ! ! ||| 0.0517067 0.363573 0.00795487 >>>> > > 5.66022e-05 ||| 0-1 >>>> > > 1-2 ||| 2 13 1 ||| ||| >>>> > > ! ! ||| nécessaire ! ! ||| 0.103413 0.363573 0.00795487 >>>> > > 0.000130572 ||| >>>> > > 0-1 1-2 ||| 1 13 1 ||| ||| >>>> > > ! [ never again ! ||| ! ||| 6.51628e-06 5.42074e-13 >>>> > > 0.103413 >>>> > > 0.796143 ||| 0-0 4-0 ||| 15870 1 1 ||| ||| >>>> > > ! ] this is ||| tel est ||| 7.38667e-05 9.16191e-11 >>>> > > 0.103413 >>>> > > 0.00147917 ||| 2-0 3-1 ||| 1400 1 1 ||| ||| >>>> > > ! ] this ||| tel ||| 1.09594e-05 1.44188e-10 0.103413 >>>> > > 0.0035893 ||| >>>> > > 2-0 ||| 9436 1 1 ||| ||| >>>> > > ! ] ||| ! ] ||| 0.103413 0.352335 0.103413 >>>> > > 0.472387 ||| 0-0 1-1 >>>> > > ||| 1 1 1 ||| ||| >>>> > > ! & quot ; ||| ! " . et ||| 0.0517067 2.36396e-12 >>>> > > 0.0517067 >>>> > > 1.88268e-05 ||| 0-0 1-1 2-1 3-3 ||| 2 2 1 ||| ||| >>>> > > ! & quot ; ||| ! " ||| 0.000222394 1.44515e-11 >>>> > > 0.0517067 >>>> > > 0.518419 ||| 0-0 2-1 ||| 465 2 1 ||| ||| >>>> > > ! & quot ||| ! " . ||| 0.000662906 8.30626e-09 >>>> > > 0.0344711 >>>> > > 0.00232791 ||| 0-0 1-1 2-1 ||| 156 3 1 ||| ||| >>>> > > ! & quot ||| ! " ||| 0.00218918 8.30626e-09 >>>> > > 0.339323 0.518419 >>>> > > ||| 0-0 2-1 ||| 465 3 2 ||| ||| >>>> > > ! & ||| ! ||| 6.51628e-06 7.21755e-05 0.103413 >>>> > > 0.796143 ||| 0-0 ||| >>>> > > 15870 1 1 ||| ||| >>>> > > ! ' ] , addressed ||| ! " adressé ||| >>>> > > 0.103413 3.70838e-07 >>>> > > 0.103413 0.00596848 ||| 0-0 1-1 2-1 4-2 ||| 1 1 1 ||| ||| >>>> > > ! ' ] , ||| ! " ||| 0.000222394 2.49698e-06 >>>> > > 0.103413 >>>> > > 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| ||| >>>> > > ! ' ] ||| ! " ||| 0.000222394 3.57128e-05 >>>> > > 0.103413 >>>> > > 0.215573 ||| 0-0 1-1 2-1 ||| 465 1 1 ||| ||| >>>> > > ! ' ' Alstom shares ||| l' on constate un >>>> > > dysfonctionnement ||| 0.0344711 5.62605e-16 0.103413 >>>> > > 1.03361e-14 ||| 1-0 >>>> > > 2-0 1-1 3-4 4-4 ||| 3 1 1 ||| ||| >>>> > > ! ' ' ||| l' on constate un ||| 0.0147733 >>>> > > 1.56906e-11 >>>> > > 0.0129267 2.2766e-12 ||| 1-0 2-0 1-1 ||| 7 8 1 ||| ||| >>>> > > ! ' ' ||| l' on constate ||| 0.000984889 >>>> > > 1.56906e-11 >>>> > > 0.0129267 2.36929e-10 ||| 1-0 2-0 1-1 ||| 105 8 1 ||| ||| >>>> > > ! ' ' ||| l' on ||| 6.76656e-06 1.56906e-11 >>>> > > 0.0129267 >>>> > > 6.18613e-06 ||| 1-0 2-0 1-1 ||| 15283 8 1 ||| ||| >>>> > > ! ' ' ||| ou que l' on constate ||| >>>> > > 0.0344711 1.56906e-11 >>>> > > 0.0129267 4.69534e-15 ||| 1-2 2-2 1-3 ||| 3 8 1 ||| ||| >>>> > > ! ' ' ||| ou que l' on ||| 0.00304157 >>>> > > 1.56906e-11 >>>> > > 0.0129267 1.22594e-10 ||| 1-2 2-2 1-3 ||| 34 8 1 ||| ||| >>>> > > ! ' ' ||| que l' on constate un ||| >>>> > > 0.0344711 1.56906e-11 >>>> > > 0.0129267 4.56092e-14 ||| 1-1 2-1 1-2 ||| 3 8 1 ||| ||| >>>> > > ! ' ' ||| que l' on constate ||| 0.00323167 >>>> > > 1.56906e-11 >>>> > > 0.0129267 4.74661e-12 ||| 1-1 2-1 1-2 ||| 32 8 1 ||| ||| >>>> > > >>>> > > >>>> > > >>>> > > Le 23/09/2015 15:12, Tom Hoar a écrit : >>>> > > > Vincent, >>>> > > > >>>> > > > If you suspect bad entries, isn't it better to address >>>> > > > the root of the >>>> > > > problem and prepare your training corpus better? >>>> > > > >>>> > > > >>>> > > > On 9/23/2015 6:46 PM, [email protected] >>>> > > > wrote: >>>> > > > > Date: Tue, 22 Sep 2015 20:24:02 +0200 >>>> > > > > From: Philipp Koehn<[email protected]> >>>> > > > > Subject: Re: [Moses-support] is there a way to remove >>>> > > > > a bad entry in >>>> > > > > the phrase table ? >>>> > > > > To: Vincent Nguyen<[email protected]> >>>> > > > > Cc: moses-support<[email protected]> >>>> > > > > >>>> > > > > Hi, >>>> > > > > >>>> > > > > you can remove it manually (just edit the text file), >>>> > > > > there will be no >>>> > > > > negative consequences. >>>> > > > > >>>> > > > > However, it is not a realistic strategy to try to >>>> > > > > remove by hand every >>>> > > > > offending phrase table entry. >>>> > > > > >>>> > > > > -phi >>>> > > > > >>>> > > > > On Tue, Sep 22, 2015 at 4:05 PM, Vincent >>>> > > > > Nguyen<[email protected]> wrote: >>>> > > > > >>>> > > > > > >Hi, >>>> > > > > > > >>>> > > > > > >I was wondering if after an analysis of the >>>> > > > > > BLEU-Annotation file we >>>> > > > > > >realize that there must be a bad entry in the >>>> > > > > > phrase table, >>>> > > > > > >we could remove it manually or in some other >>>> > > > > > ways ? >>>> > > > > > > >>>> > > > > > >Gracias. >>>> > > > > > >V. >>>> > > > > > >_______________________________________________ >>>> > > > > > >Moses-support mailing list >>>> > > > > > >[email protected] >>>> > > > > > >http://mailman.mit.edu/mailman/listinfo/moses-support >>>> > > > > > > >>>> > > > >>>> > > > -- >>>> > > > Best regards, >>>> > > > >>>> > > > Tom Hoar >>>> > > > Chief Executive Officer >>>> > > > /*Precision Translation Tools Pte Ltd*/ >>>> > > > Singapore/Thailand >>>> > > > Web: www.precisiontranslationtools.com >>>> > > > <http://www.precisiontranslationtools.com> >>>> > > > Thailand Mobile: +66 87 345-1875 >>>> > > > Skype: tahoar >>>> > > > >>>> > > > >>>> > > > _______________________________________________ >>>> > > > Moses-support mailing list >>>> > > > [email protected] >>>> > > > http://mailman.mit.edu/mailman/listinfo/moses-support >>>> > > >>>> > > >>>> > > >>>> > > _______________________________________________ >>>> > > Moses-support mailing list >>>> > > [email protected] >>>> > > http://mailman.mit.edu/mailman/listinfo/moses-support >>>> > > >>>> > >>>> >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
