Hello all,

I'm interested in using the extended output search graph (osgx) output from
Moses.

First, I have a patch you might be interested in.  When I printed out a few
toy examples, I noticed that there was no mention of of the input coverage
of the output (as there *is* in the osg format), so I made a little patch
that fixes that.

Here's the diff:

--- mosesdecoder/trunk/moses/src/Manager.cpp    2011-01-18
22:43:58.000000000 -0500
+++ Manager.cpp 2011-01-18 22:59:11.000000000 -0500
@@ -568,6 +568,10 @@
        StaticData::Instance().GetScoreIndexManager().PrintLabeledScores(
outputSearchGraphStream, scoreBreakdown );
        outputSearchGraphStream << " ]";

+       // added this so that we will have the span in the input covered
(why wasn't this in the extended format?)
+       // (DNM, 19 Nov 2010)
+       outputSearchGraphStream << " covered=" <<
searchNode.hypo->GetCurrSourceWordsRange().GetStartPos()
+                               << "-" <<
searchNode.hypo->GetCurrSourceWordsRange().GetEndPos();
        outputSearchGraphStream << " out=" <<
searchNode.hypo->GetCurrTargetPhrase().GetStringRep(outputFactorOrder) <<
endl;
 }

That seems to do it.  You can of course omit my snide remarks and my
initials from the patch, should you choose to use it.

Also, I had a question. When toying around with the (patched) osgx output, I
see that, ostensibly, all of the model component scores are mentioned.  I
wonder exactly what is being scored, though.  First off, are these scores
(when appropriate, e.g., the lm scores) based on what came "before" -- i.e.,
on the content of the nodes that these nodes point back to?  Whether they
are or not, I get strange results on a toy example I cooked up.

Using the 197 sentence pairs in the europarl de-en corpus that meet the
standard 80 word max cutoff (with aggressive tokenization of the German, but
not of the English), I trained up a little model.  Translating the sentence
"das ist nicht schlecht ." (a silly sentence that I could, with my limited
German, compose using the limited resources of the toy phrase table), gives
an osgx file with the following entries in it (among others):

...
0 hyp=1 back=0 [ d: 0.000 w: -1.000 u: 0.000 d: -0.511 0.000 0.000 0.000
0.000 0.000 lm: -4.802 -100.000 tm: -2.398 0.000 -5.011 0.000 1.000 ]
covered=0-0 out=that
0 hyp=6 back=0 [ d: 0.000 w: -1.000 u: 0.000 d: -1.609 0.000 0.000 0.000
0.000 0.000 lm: -4.627 -100.000 tm: -1.099 -5.088 -5.011 0.000 1.000 ]
covered=0-0 out=this
...

So far, so good.  These two hypotheses translate the span 0-0 (i.e., "das"),
and they are at the beginning of the English output sentence (back=0, i.e.,
they point back to the initial, empty hypothesis).  So, presumably, the
first lm score (a word-based lm) should be a score over either "<s> that"
(resp, "<s> this"), if this is a score based on the prior hypothesis that it
points back to, or "that" (resp, "this"), if not.

But looking in the toy lm file, we see that:

-2.001529       that    -0.3822374
...
-2.162679       this    -0.3372842
...
-2.085553       <s> that        -0.1508171
...
-2.009406       <s> this        -0.01284565

none of which gibes with what we see for the first of the two lm component
scores in the osgx file.

Does anyone know the gory details of the osg(x) file output enough to
advise?

Best,
D.N. ("Dennis")
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to