Hi Tom,

I encountered this problem a few days ago with mosesserver and indeed I had to 
add a few of you proposed substitutions so I welcome your suggestions. Also, 
there are a few other non-xml substitutions that could be added to the 
tokenizer script, such as for fancy curly quotes or apostrophes. A couple of 
them I had to add, are:

#right curly apostrophe
$text =~ s/\N{U+2019}/ \'/g;

#Curly quotes
$text =~ s/\N{U+201C}/ \" /g;
$text =~ s/\N{U+201D}/ \" /g;

Note that I had to use the Unicode notation because the script couldn't find 
them otherwise; however, I use rather old and somewhat customized versions of 
the scripts so I'm not absolutely sure these are indeed needed.

Panos

On Tuesday 29 of May 2012 13:21:43 Tom Hoar wrote:

Updates to tokenizer.perl and detokenizer.perl include escaping characters 
that moses reserve for other use, such as table delimiters and brackets. Is 
there a reason the updates are missing the apostrophe and quite, which are two 
of the five XML reserved characters? Not escaping them could affect 
communications with mosesserver.
I propose adding these two escape sequences to the tokenizer and detokenizer 
scripts.
tokenizer.perl, line 151
  #escape special chars
  $text =~ s/\&/\&/g; # XML
  $text =~ s/\|/\&bar;/g;   # moses
  $text =~ s/\</\&lt;/g;     # XML
  $text =~ s/\>/\&gt;/g;    # XML
  $text =~ s/\[/\&bra;/g;   # moses
  $text =~ s/\]/\&ket;/g;   # moses
  $text =~ s/\'/\&apos;/g;  # XML
  $text =~ s/\"/\&quot;/g;  # XML

detokenizer.perl, line 67
  # de-escape special chars
  $text =~ s/\&apos;/\'/g;  # XML
  $text =~ s/\&quot;/\"/g;  # XML
  $text =~ s/\&bar;/\|/g;    # moses
  $text =~ s/\&lt;/\</g;     # XML
  $text =~ s/\&gt;/\>/g;    # XML
  $text =~ s/\&bra;/\[/g;    # moses
  $text =~ s/\&ket;/\]/g;    # moses
  $text =~ s/\&amp;/\&/g;  # XML




_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to