Hi Tom,

thanks for your suggestions, I added the escaping for quotes.

Panos: the things you are suggestion are most text normalization.
We provided a script for some of these as part of the WMT
evaluation campaign, which includes the example you give.
But these things are not required for avoiding to trip up the
decoder, so I'd rather not include them in the tokenizer.

-phi

On Tue, May 29, 2012 at 1:02 AM, Panos Kanavos <[email protected]> wrote:
> Hi Tom,
>
>
>
> I encountered this problem a few days ago with mosesserver and indeed I had
> to add a few of you proposed substitutions so I welcome your suggestions.
> Also, there are a few other non-xml substitutions that could be added to the
> tokenizer script, such as for fancy curly quotes or apostrophes. A couple of
> them I had to add, are:
>
>
>
> #right curly apostrophe
>
> $text =~ s/\N{U+2019}/ \'/g;
>
>
>
> #Curly quotes
>
> $text =~ s/\N{U+201C}/ \" /g;
>
> $text =~ s/\N{U+201D}/ \" /g;
>
>
>
> Note that I had to use the Unicode notation because the script couldn't find
> them otherwise; however, I use rather old and somewhat customized versions
> of the scripts so I'm not absolutely sure these are indeed needed.
>
>
>
> Panos
>
>
>
> On Tuesday 29 of May 2012 13:21:43 Tom Hoar wrote:
>
> Updates to tokenizer.perl and detokenizer.perl include escaping characters
> that moses reserve for other use, such as table delimiters and brackets. Is
> there a reason the updates are missing the apostrophe and quite, which are
> two of the five XML reserved characters? Not escaping them could affect
> communications with mosesserver.
>
> I propose adding these two escape sequences to the tokenizer and detokenizer
> scripts.
>
> tokenizer.perl, line 151
>   #escape special chars
>   $text =~ s/\&/\&amp;/g; # XML
>   $text =~ s/\|/\&bar;/g;   # moses
>   $text =~ s/\</\&lt;/g;     # XML
>   $text =~ s/\>/\&gt;/g;    # XML
>   $text =~ s/\[/\&bra;/g;   # moses
>   $text =~ s/\]/\&ket;/g;   # moses
>   $text =~ s/\'/\&apos;/g;  # XML
>   $text =~ s/\"/\&quot;/g;  # XML
>
> detokenizer.perl, line 67
>   # de-escape special chars
>   $text =~ s/\&apos;/\'/g;  # XML
>   $text =~ s/\&quot;/\"/g;  # XML
>   $text =~ s/\&bar;/\|/g;    # moses
>   $text =~ s/\&lt;/\</g;     # XML
>   $text =~ s/\&gt;/\>/g;    # XML
>   $text =~ s/\&bra;/\[/g;    # moses
>   $text =~ s/\&ket;/\]/g;    # moses
>   $text =~ s/\&amp;/\&/g;  # XML
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to