Hi everyone, I recently ran an experiment through the EMS, evaluating with both nist-bleu and multi-bleu. The latter reported a score of about 10 points lower than the former. I think this is due to an error in experiment.meta. According to experiment.meta, the call to multi-bleu is
multi-bleu.perl reference < recased-output However, 'reference' is tokenized and lowercased, whereas 'recased-output' is tokenized and cased. It seems that either a) 'reference' should be replaced by 'tokenized-reference', which is tokenized and cased (getting a cased BLEU score), or b) 'recased-output' should be replaced by 'system-output' (or 'cleaned-output'?), which is tokenized and lowercased (getting a case-insensitive BLEU score). Running these calls by hand get BLEU scores close to those given by nist-bleu-c and nist-bleu, respectively. Does this sound right? Should multi-bleu be case-sensitive or case-insensitive (i.e. is (a) or (b) the best fix)? If (b), should it be 'system-output' or 'cleaned-output'? I posted this to the list a few days ago, but in response to a different message, so I thought I should repost it in case someone who knows the answer overlooked the original. My apologies for the duplicate posting. Thanks and happy holidays, Suzy -- Suzy Howlett http://www.showlett.id.au/ _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
