I think defaulting to escaping fulfills the defensive strategy and a command argument to disable the escaping is worth considering.

FYI, for the few times I want tokenization without escaping, I simply run the text through the scripts/tokenizer/deescape-special-chars.perl script. A command line argument to disable would do the same thing in one pass.


Best regards,
Tom Hoar
Managing Director
*Precision Translation Tools Co., Ltd.*
Bangkok, Thailand
Web: www.precisiontranslationtools.com <http://www.precisiontranslationtools.com>
Mobile: +66 87 345-1875
Skype: tahoar

On 02/20/2013 01:03 AM, Hieu Hoang wrote:
i agree with you that Moses should only work with text, and that it should be up to the user strip out XML from the input.

however, escaping xml stuff is really a defensive strategy, so that the decoder doesn't choke in case input hasn't been cleaned of xml. I think this has made decoding a bit more reliable.

what else do you suggest?

On 19 February 2013 17:10, Nicholas Ruiz <[email protected] <mailto:[email protected]>> wrote:

    Hi everyone,

    Question/comment/feature request regarding tokenizer.perl.

    Question: Why does tokenizer.perl version 1.1 provide html/xml
    encodings of special characters, such as the apostrophe?
    e.g. Please rise , then , for this minute &apos;s silence .
    https://webmail.fbk.eu/owa/?ae=PreFormAction&t=AddressBook&a=Done&ctx=2#
    Comment: If this is for XML compatibility, couldn't any relevant
    XML markup be annotated with CDATA to ignore parsing? Why should
    this be done within Moses? In my opinion**, Moses should just work
    with text. Otherwise, it's up to the user to decode the text in
    order to use POS taggers, etc that typically use the same
    tokenization strategies as tokenizer.perl 1.0. (Of course, we
    still need a "|" encoding)

    ** My opinion as a PhD student -- the value of that is left to the
    reader.

    Feature request: There's already a -x flag that skips XML fields.
    What do you think about a flag to enable/disable encodings? (In my
    opinion, it should default to being disabled.)

    Thanks for your time,
    Nick Ruiz
    _______________________________________________
    Moses-support mailing list
    [email protected] <mailto:[email protected]>
    http://mailman.mit.edu/mailman/listinfo/moses-support




--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to