Hi John, Thanks for pointing out the issue; I added support for arbitrary encodings to the script, by default it's set to UTF8 but you can change the global variable on line 23 for other encodings; just update the file from SVN.
Treating non-ascii characters as separate tokens by wrapping them in spaces should not be the right thing to do in the general case, as far as I understand. Best, Mark On Mon, Nov 29, 2010 at 12:34 AM, John Morgan <[email protected]> wrote: > Hi, > I'd like to use the script > bootstrap-hypothesis-difference-significance.pl > to compare 2 systems that translate from English into languages that > use non-ascii character encodings. > I think this script is written for English hypothesis and reference files. > I guess that an option similar to the -e option to mteval needs to be > added to the script to make it work for non-ascii files. > I added the following line to the script at line 240 after the "while" > statement slurps in a line from the opened file: > s/([^[:ascii:]])/ $1 /g > It looks like this is all the -e option to mteval does. > I have 2 questions: > Is this the correct way to get the bootstrap script to work on > non-ascii text files? > If yes, can anyone explain to me why? > Why do we need to wrap white space around nonascii characters? > > When I do this the BLEU scores look reasonable (but I could be fooling > myself). > > > -- > Regards, > John J Morgan > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
