Re: [Moses-support] Legacy tokenizer.perl functionality.

Tom Hoar Fri, 16 Jan 2015 08:34:24 -0800

"It was a pain..." See, aren't you glad you didn't rely on my Perl skills?

I'm almost sorry I raised the controversy. I forgot Moses started withand got away from versioned script folders. Maybe the responsibility forversion tracking is best left to those who critically need it.




On 01/16/2015 11:16 PM, Hieu Hoang wrote:

On 16 January 2015 at 16:13, Barry Haddow <[email protected]<mailto:[email protected]>> wrote:


    Hi

    Yes, the EMS (experiment management system) included with Moses will
    also deal with this by checking timestamps on the tokeniser scripts.

    If you use your models outside the EMS (or Eman etc) however then
    there's no easy way to ensure compatibility between tokeniser and
    model.
    I agree that the tokeniser shouldn't be doing text normalisation,
    but it
    was, and fixing it could cause more pain than leaving things as
    they are,

oh, i just moved the normalisation to normalize-punctuation.perl. Itwas a pain

https://github.com/moses-smt/mosesdecoder/commit/19d7c44aad1d1b06b884833cc3a7b0e14d2a6c36
https://github.com/moses-smt/mosesdecoder/commit/30e31d4a95713cd5340941b15d566f09b3b1a2d7
Lets see what happens!


    cheers - Barry

    On 16/01/15 15:07, Ondrej Bojar wrote:
    > Hi, Christian,
    >
    > when the scripts directory of moses was first created back in
    2006, we had the same issues with versioning. At that point, I
    created the (ugly) need 'install' the scripts, mainly to provide
    all of them with a version number. Fortunately, we now got rid of
    this and the scripts are meant to be used rightaway after checkout.
    >
    > I'm saying this just to point out that there is probably no
    ideal way of keeping up to date and yet ensuring compatibility for
    existing models with toolkits as complex as moses is.
    >
    > For this, I use my eman, an experiment manager where even moses
    toolkit itself is something timestamped. So I have a a couple of
    moses checkouts, timestamped, and my models depend on one of them.
    Moving to a fresher moses checkout is easy (a new timestamped
    directory gets created), but requires to redo all the models
    (well, eman does this for me, so it's just waste of computer space
    and time, not mine).
    >
    > Cheers, O.
    >
    > ----- Original Message -----
    >> From: "Christian Hardmeier" <[email protected] <mailto:[email protected]>>
    >> To: "Hieu Hoang" <[email protected] <mailto:[email protected]>>
    >> Cc: "Tom Hoar" <[email protected]
    <mailto:[email protected]>>, "moses-support
    support" <[email protected] <mailto:[email protected]>>
    >> Sent: Friday, 16 January, 2015 15:26:15
    >> Subject: Re: [Moses-support] Legacy tokenizer.perl functionality.
    >> On Jan 16, 2015, at 12:46 PM, Hieu Hoang wrote:
    >>
    >>> i think it's too difficult to police.
    >> You'd probably need a regression test that checks if the
    tokenised output is
    >> still the same so changes don't go unnoticed. But of course
    it's still some
    >> extra work.
    >>
    >>> Another idea is to get the script to md5 its own source code,
    and the non-prefix
    >>> files it uses.
    >> That would definitely be better than nothing, even though it
    would raise false
    >> alarms from time to time.
    >>
    >>> On 16/01/15 11:12, Christian Hardmeier wrote:
    >>>> On Jan 16, 2015, at 11:51 AM, Tom Hoar wrote:
    >>>>
    >>>>> I agree with versioning. Could be added to the command line.
    >>>>>
    >>>>> Also agree that this proposed change qualifies as a version
    change.
    >>>>>
    >>>>> How to you propose managing the issue of output changes due to
    >>>>> command-line switches, like -no-escape?
    >>>> Very good question. To be consistent, you'd probably have to
    increment the
    >>>> version number even if the change only applies when you use a
    certain
    >>>> command-line switch. But not if it doesn't affect the input,
    and maybe not if
    >>>> you just add a new command-line switch that is off by
    default. What do you
    >>>> think?
    >>>>
    >>>>
    >>>>
    >>>>> On 01/16/2015 05:36 PM, Christian Hardmeier wrote:
    >>>>>> I'd like to suggest that there should be a version number
    in the tokeniser that
    >>>>>> is incremented whenever the output changes, even if the
    change is minor and
    >>>>>> even if it's just a bugfix. Otherwise when you pull a new
    version of moses you
    >>>>>> don't know if the output of tokenizer.perl is still
    compatible with your
    >>>>>> existing models. (Moving functionality from tokenizer.perl to
    >>>>>> normalize-punctuation.perl would count as a change from my
    point of view. I
    >>>>>> don't always use normalize-punctutation.)
    >>>>>>
    >>>>>> /Christian
    >>>>>>
    >>>>>> On Jan 16, 2015, at 10:36 AM, Hieu Hoang wrote:
    >>>>>>
    >>>>>>> it's probably a good idea to make this change. If you've
    done it
    >>>>>>> already, please send me the updated scripts and I'll check
    it in. If
    >>>>>>> not, I'll do it myself
    >>>>>>>
    >>>>>>> there's hopefully a fast, C++ tokenizer replacement coming
    soon.
    >>>>>>> Highlighting these issues now is useful to understanding
    exactly how the
    >>>>>>> tokenizer works/should work
    >>>>>>>
    >>>>>>> On 15/01/15 01:52, Tom Hoar wrote:
    >>>>>>>> This is a separate issue from the parallel "Tokenization
    problem" thread...
    >>>>>>>>
    >>>>>>>> The tokenizer.perl has had one line that transforms the
    grave accent (`)
    >>>>>>>> to apostrophe and another that transforms double
    apostrophe ('') to to
    >>>>>>>> single quote. I suspect these have been in the script
    since the
    >>>>>>>> beginning. However, they recently "bit" me on a recent
    project. Easy
    >>>>>>>> enough to work around.
    >>>>>>>>
    >>>>>>>> Still, I'm wondering. Do they still belong in the
    tokenizer.perl script?
    >>>>>>>> Or, should they moved into one of the other scripts? The
    >>>>>>>> normalize-punctuation.perl script seems to be a good
    candidate.
    >>>>>>>> _______________________________________________
    >>>>>>>> Moses-support mailing list
    >>>>>>>> [email protected] <mailto:[email protected]>
    >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
    >>>>>>>>
    >>>>>>> _______________________________________________
    >>>>>>> Moses-support mailing list
    >>>>>>> [email protected] <mailto:[email protected]>
    >>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
    >>>>>> _______________________________________________
    >>>>>> Moses-support mailing list
    >>>>>> [email protected] <mailto:[email protected]>
    >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
    >>>>> _______________________________________________
    >>>>> Moses-support mailing list
    >>>>> [email protected] <mailto:[email protected]>
    >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
    >>>> _______________________________________________
    >>>> Moses-support mailing list
    >>>> [email protected] <mailto:[email protected]>
    >>>> http://mailman.mit.edu/mailman/listinfo/moses-support
    >>>>
    >>
    >> _______________________________________________
    >> Moses-support mailing list
    >> [email protected] <mailto:[email protected]>
    >> http://mailman.mit.edu/mailman/listinfo/moses-support


    --
    The University of Edinburgh is a charitable body, registered in
    Scotland, with registration number SC005336.

    _______________________________________________
    Moses-support mailing list
    [email protected] <mailto:[email protected]>
    http://mailman.mit.edu/mailman/listinfo/moses-support




--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Legacy tokenizer.perl functionality.

Reply via email to