Hi Philipp,
Thanks for the reply. I tracked some of the cases down to a *known* word
(or whitespace-tokenized thingie, anyway -- I don't know much of what
constitutes a word in written Chinese) by doing the following:
----------------------------------------------------------------------
$ echo "说了算" | moses_chart -f moses.ini -cube-pruning-pop-limit 2000
Translating: <s> 说了算 </s> ||| [0,0]=X (1) [0,1]=X (1) [0,2]=X (1) [1,1]=X
(1) [1,2]=X (1) [2,2]=X (1)
Num of hypo = 1591 --- cells:
0 1 2
1 3 1587
0 0
0
NO BEST TRANSLATION
----------------------------------------------------------------------
(An aside: 1587 is the number of categories in the unknown word list. Why
does the last token, viz., "</s>", get that many cells? )
Anyhow, sure enough, there are three entries for the middle token "说了算"
----------------------------------------------------------------------
$ zless rule-table.gz
...
说了算 [X] ||| is [((S\NP[expl])/(S[to]\NP))/(S[adj]\NP)] ||| 0.000113126
6.94e-05 0.00475133 0.5 2.718 ||| ||| 126 3
说了算 [X] ||| is necessary [(S\NP[expl])/(S[to]\NP)] ||| 0.000309866 6.94e-05
0.00475133 0.00028945 2.718 ||| ||| 46 3
说了算 [X] ||| is necessary to [(S\NP[expl])/(S[b]\NP)] ||| 0.000208847
6.94e-05 0.00475133 1.07891e-05 2.718 ||| ||| 68.25 3
...
----------------------------------------------------------------------
There are entries in the glue table for these three categories --
((S\NP[expl])/(S[to]\NP))/(S[adj]\NP), (S\NP[expl])/(S[to]\NP) and
(S\NP[expl])/(S[b]\NP) --- so we should be able to hack together a
translation using any of them.
----------------------------------------------------------------------
<s> [X] ||| <s> [Q] ||| 1 |||
...
[X][Q] [X][((S\NP[expl])/(S[to]\NP))/(S[adj]\NP)] [X] ||| [X][Q]
[X][((S\NP[expl])/(S[to]\NP))/(S[adj]\NP)] [Q] ||| 2.718 ||| 0-0 1-1
...
[X][Q] [X][(S\NP[expl])/(S[to]\NP)] [X] ||| [X][Q]
[X][(S\NP[expl])/(S[to]\NP)] [Q] ||| 2.718 ||| 0-0 1-1
...
[X][Q] [X][(S\NP[expl])/(S[b]\NP)] [X] ||| [X][Q]
[X][(S\NP[expl])/(S[b]\NP)] [Q] ||| 2.718 ||| 0-0 1-1
...
----------------------------------------------------------------------
And just to be sure that it isn't an unknown word problem, let's mangle the
token "说了算" by deleting the last character and see what happens:
----------------------------------------------------------------------
$ echo "说了" | ../moses/bin/moses-chart-19-june-2011 -f
dev-test/ZhEn/mert/run1.moses.ini -cube-pruning-pop-limit 2000
Translating: <s> 说了 </s> ||| [0,0]=X (1) [0,1]=X (1) [0,2]=X (1) [1,1]=X
(1) [1,2]=X (1) [2,2]=X (1)
Num of hypo = 6396 --- cells:
0 1 2
1 1587 1587
1 0
1
BEST TRANSLATION: 4763 Q </s> :0-0 : pC=0.000, c=-1.002 [0..2] 3176
[total=-22.789] <<-1.303, -1.940, -46.302, 0.000, 0.000, 0.000, 0.000,
0.000, 1.000>>
说了
----------------------------------------------------------------------
The best "translation" is just a pass-through, as expected (and there are
1587 nodes for that unknown token -- just as many as there are unknown word
lhs's in the unknown-lhs file).
Strange. Very strange. Or am I missing the obvious?
I'm at a loss here. Does anyone have any guesses as to what's going on
here?
--D.N.
2011/6/22 Philipp Koehn <[email protected]>
> Hi,
>
> there always should be a rule to combine a span to the left.
>
> Check what labels are chosen for the 13th word, and why there
> are no glue rules for it.
>
> If I would hazard a guess, I would suspect that this is an
> unknown word and a file with the likely labels for unknown words
> is used, but these do not match the glue grammar.
>
> -phi
>
> 2011/6/22 Dennis Mehay <[email protected]>:
> > Hi all,
> >
> > I posted this, but it bounced. My attachments were too big. I'm
> resending
> > without the larger attachment. Apologies for any duplicate posting.
> >
> > I'm running moses_chart to do some syntax-based MT experiments, and,
> during
> > tuning, I'm coming across some instances where the decoder can't produce
> a
> > translation (btw 32 and 38 in a 500 sentence tuning set). This should
> not
> > be happening, so far as I can tell, since I have a glue grammar (where
> all
> > the nonterminals of the training set plus the [Q] nonterminal are
> accounted
> > for), and an 'unknown-lhs' list with the relative frequencies of all the
> > categories as they span only a single word in the training set (i.e., the
> > frequency of each category's spanning a single word in the rule table /
> the
> > total number of single-word instances in the rule table).
> >
> > Here is an example of a sentence that there was no translation for:
> >
> > ------------------------------
> > ---------------------------------------------------------
> > Translating: <s> 没有 规划 作 指导 , 就 可能 出现 谁 有 权 谁 说了算 , 谁 官 大 谁 说了算 . </s>
> > ...
> > Decoding:
> > Num of hypo = 84813 --- cells:
> > 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
> 18
> > 19 20 21
> > 1 100 77 93 83 99 99 100 100 85 99 43 85 3 99 85 18 100
> > 85 3 14 1000
> > 40 960 278 717 916 857 976 276 396 952 958 150 0 0 919 74 402 802
> > 0 0 12
> > 200 975 908 849 850 858 968 971 971 862 974 0 0 0 852 865 984
> > 0 0 0
> > 200 940 849 889 763 715 990 962 979 905 0 0 0 0 864 984 0
> > 0 0
> > 200 868 939 886 863 803 887 861 981 0 0 0 0 0 871 0
> > 0 0
> > 200 828 910 801 838 796 722 870 0 0 0 0 0 0 0 0
> > 0
> > 200 799 914 832 801 745 926 0 0 0 0 0 0 0 0
> 0
> > 200 756 819 901 693 692 0 0 0 0 0 0 0 0 0
> > 200 716 680 665 437 0 0 0 0 0 0 0 0 0
> > 200 683 527 929 0 0 0 0 0 0 0 0 0
> > 200 532 588 0 0 0 0 0 0 0 0 0
> > 200 580 0 0 0 0 0 0 0 0 0
> > 200 0 0 0 0 0 0 0 0 0
> > 0 0 0 0 0 0 0 0 0
> > 0 0 0 0 0 0 0 0
> > 0 0 0 0 0 0 0
> > 0 0 0 0 0 0
> > 0 0 0 0 0
> > 0 0 0 0
> > 0 0 0
> > 0 0
> > 0
> > NO BEST TRANSLATION
> >
> > Translation took 4.340 seconds
> >
> ---------------------------------------------------------------------------------------
> >
> > The ASCII-art chart's alignment may be a bit off, but, just eye-balling
> it,
> > it looks as if the 19th word (index 18) has a chart entry count above it,
> > but then this entry does not get combined with what's to the left using
> the
> > glue rules.
> >
> > Could this be a pruning or cutoff issue (i.e., stack size,
> > cube-pruning-pop-limit, maximum number of rules per span, etc.)? Or
> maybe
> > it has to do with the fact that my unknown-lhs file has *all* categories
> > that spanned a single word in the training set. Maybe I should prune it
> to
> > the top 10 or 20, or so. I'm really at a loss here. I thought the glue
> > grammar would make the decoder always return an answer, no matter how
> awful.
> >
> > Any insight?
> >
> > I have attached my moses.ini file in case anyone wants to have a look. I
> > can also send the glue rule file later, but, as I said, it seems to
> account
> > for all of the training set's categories (and it was produced
> automatically
> > using the -glue-grammar option).
> >
> > Best,
> > Dennis
> > _______________________________________________
> > Moses-support mailing list
> > [email protected]
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support