Hi John,

Thanks for pointing out the issue; I added support for arbitrary
encodings to the script, by default it's set to UTF8 but you can
change the global variable on line 23 for other encodings; just update
the file from SVN.

Treating non-ascii characters as separate tokens by wrapping them in
spaces should not be the right thing to do in the general case, as far
as I understand.

Best,
Mark

On Mon, Nov 29, 2010 at 12:34 AM, John Morgan
<[email protected]> wrote:
> Hi,
> I'd like to use the script
> bootstrap-hypothesis-difference-significance.pl
> to compare 2 systems that translate from English into languages that
> use non-ascii character encodings.
> I think this script is written for English hypothesis and reference files.
> I guess that an option similar to the -e option to mteval needs to be
> added to the script to make it work for non-ascii files.
> I added the following line to the script at line 240 after the "while"
> statement slurps in a line from the opened file:
> s/([^[:ascii:]])/ $1 /g
> It looks like this is all the -e option to mteval does.
> I have 2 questions:
> Is this the correct way to get the bootstrap script to work on
> non-ascii text files?
> If yes, can anyone explain to me why?
> Why do we need to wrap white space around nonascii characters?
>
> When I do this the BLEU scores look reasonable (but I could be fooling
> myself).
>
>
> --
> Regards,
> John J Morgan
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to