Hi Tom,
I encountered this problem a few days ago with mosesserver and indeed I had to
add a few of you proposed substitutions so I welcome your suggestions. Also,
there are a few other non-xml substitutions that could be added to the
tokenizer script, such as for fancy curly quotes or apostrophes. A couple of
them I had to add, are:
#right curly apostrophe
$text =~ s/\N{U+2019}/ \'/g;
#Curly quotes
$text =~ s/\N{U+201C}/ \" /g;
$text =~ s/\N{U+201D}/ \" /g;
Note that I had to use the Unicode notation because the script couldn't find
them otherwise; however, I use rather old and somewhat customized versions of
the scripts so I'm not absolutely sure these are indeed needed.
Panos
On Tuesday 29 of May 2012 13:21:43 Tom Hoar wrote:
Updates to tokenizer.perl and detokenizer.perl include escaping characters
that moses reserve for other use, such as table delimiters and brackets. Is
there a reason the updates are missing the apostrophe and quite, which are two
of the five XML reserved characters? Not escaping them could affect
communications with mosesserver.
I propose adding these two escape sequences to the tokenizer and detokenizer
scripts.
tokenizer.perl, line 151
#escape special chars
$text =~ s/\&/\&/g; # XML
$text =~ s/\|/\&bar;/g; # moses
$text =~ s/\</\</g; # XML
$text =~ s/\>/\>/g; # XML
$text =~ s/\[/\&bra;/g; # moses
$text =~ s/\]/\&ket;/g; # moses
$text =~ s/\'/\'/g; # XML
$text =~ s/\"/\"/g; # XML
detokenizer.perl, line 67
# de-escape special chars
$text =~ s/\'/\'/g; # XML
$text =~ s/\"/\"/g; # XML
$text =~ s/\&bar;/\|/g; # moses
$text =~ s/\</\</g; # XML
$text =~ s/\>/\>/g; # XML
$text =~ s/\&bra;/\[/g; # moses
$text =~ s/\&ket;/\]/g; # moses
$text =~ s/\&/\&/g; # XML
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support