Hi everyone,

I recently ran an experiment through the EMS, evaluating with both 
nist-bleu and multi-bleu. The latter reported a score of about 10 points 
lower than the former. I think this is due to an error in 
experiment.meta. According to experiment.meta, the call to multi-bleu is

multi-bleu.perl reference < recased-output

However, 'reference' is tokenized and lowercased, whereas 
'recased-output' is tokenized and cased. It seems that either

a) 'reference' should be replaced by 'tokenized-reference', which is 
tokenized and cased (getting a cased BLEU score), or
b) 'recased-output' should be replaced by 'system-output' (or 
'cleaned-output'?), which is tokenized and lowercased (getting a 
case-insensitive BLEU score).

Running these calls by hand get BLEU scores close to those given by 
nist-bleu-c and nist-bleu, respectively.

Does this sound right? Should multi-bleu be case-sensitive or 
case-insensitive (i.e. is (a) or (b) the best fix)? If (b), should it be 
'system-output' or 'cleaned-output'?

I posted this to the list a few days ago, but in response to a different 
message, so I thought I should repost it in case someone who knows the 
answer overlooked the original. My apologies for the duplicate posting.

Thanks and happy holidays,
Suzy

-- 
Suzy Howlett
http://www.showlett.id.au/
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to