"It was a pain..." See, aren't you glad you didn't rely on my Perl skills?
I'm almost sorry I raised the controversy. I forgot Moses started with
and got away from versioned script folders. Maybe the responsibility for
version tracking is best left to those who critically need it.
On 01/16/2015 11:16 PM, Hieu Hoang wrote:
On 16 January 2015 at 16:13, Barry Haddow <[email protected]
<mailto:[email protected]>> wrote:
Hi
Yes, the EMS (experiment management system) included with Moses will
also deal with this by checking timestamps on the tokeniser scripts.
If you use your models outside the EMS (or Eman etc) however then
there's no easy way to ensure compatibility between tokeniser and
model.
I agree that the tokeniser shouldn't be doing text normalisation,
but it
was, and fixing it could cause more pain than leaving things as
they are,
oh, i just moved the normalisation to normalize-punctuation.perl. It
was a pain
https://github.com/moses-smt/mosesdecoder/commit/19d7c44aad1d1b06b884833cc3a7b0e14d2a6c36
https://github.com/moses-smt/mosesdecoder/commit/30e31d4a95713cd5340941b15d566f09b3b1a2d7
Lets see what happens!
cheers - Barry
On 16/01/15 15:07, Ondrej Bojar wrote:
> Hi, Christian,
>
> when the scripts directory of moses was first created back in
2006, we had the same issues with versioning. At that point, I
created the (ugly) need 'install' the scripts, mainly to provide
all of them with a version number. Fortunately, we now got rid of
this and the scripts are meant to be used rightaway after checkout.
>
> I'm saying this just to point out that there is probably no
ideal way of keeping up to date and yet ensuring compatibility for
existing models with toolkits as complex as moses is.
>
> For this, I use my eman, an experiment manager where even moses
toolkit itself is something timestamped. So I have a a couple of
moses checkouts, timestamped, and my models depend on one of them.
Moving to a fresher moses checkout is easy (a new timestamped
directory gets created), but requires to redo all the models
(well, eman does this for me, so it's just waste of computer space
and time, not mine).
>
> Cheers, O.
>
> ----- Original Message -----
>> From: "Christian Hardmeier" <[email protected] <mailto:[email protected]>>
>> To: "Hieu Hoang" <[email protected] <mailto:[email protected]>>
>> Cc: "Tom Hoar" <[email protected]
<mailto:[email protected]>>, "moses-support
support" <[email protected] <mailto:[email protected]>>
>> Sent: Friday, 16 January, 2015 15:26:15
>> Subject: Re: [Moses-support] Legacy tokenizer.perl functionality.
>> On Jan 16, 2015, at 12:46 PM, Hieu Hoang wrote:
>>
>>> i think it's too difficult to police.
>> You'd probably need a regression test that checks if the
tokenised output is
>> still the same so changes don't go unnoticed. But of course
it's still some
>> extra work.
>>
>>> Another idea is to get the script to md5 its own source code,
and the non-prefix
>>> files it uses.
>> That would definitely be better than nothing, even though it
would raise false
>> alarms from time to time.
>>
>>> On 16/01/15 11:12, Christian Hardmeier wrote:
>>>> On Jan 16, 2015, at 11:51 AM, Tom Hoar wrote:
>>>>
>>>>> I agree with versioning. Could be added to the command line.
>>>>>
>>>>> Also agree that this proposed change qualifies as a version
change.
>>>>>
>>>>> How to you propose managing the issue of output changes due to
>>>>> command-line switches, like -no-escape?
>>>> Very good question. To be consistent, you'd probably have to
increment the
>>>> version number even if the change only applies when you use a
certain
>>>> command-line switch. But not if it doesn't affect the input,
and maybe not if
>>>> you just add a new command-line switch that is off by
default. What do you
>>>> think?
>>>>
>>>>
>>>>
>>>>> On 01/16/2015 05:36 PM, Christian Hardmeier wrote:
>>>>>> I'd like to suggest that there should be a version number
in the tokeniser that
>>>>>> is incremented whenever the output changes, even if the
change is minor and
>>>>>> even if it's just a bugfix. Otherwise when you pull a new
version of moses you
>>>>>> don't know if the output of tokenizer.perl is still
compatible with your
>>>>>> existing models. (Moving functionality from tokenizer.perl to
>>>>>> normalize-punctuation.perl would count as a change from my
point of view. I
>>>>>> don't always use normalize-punctutation.)
>>>>>>
>>>>>> /Christian
>>>>>>
>>>>>> On Jan 16, 2015, at 10:36 AM, Hieu Hoang wrote:
>>>>>>
>>>>>>> it's probably a good idea to make this change. If you've
done it
>>>>>>> already, please send me the updated scripts and I'll check
it in. If
>>>>>>> not, I'll do it myself
>>>>>>>
>>>>>>> there's hopefully a fast, C++ tokenizer replacement coming
soon.
>>>>>>> Highlighting these issues now is useful to understanding
exactly how the
>>>>>>> tokenizer works/should work
>>>>>>>
>>>>>>> On 15/01/15 01:52, Tom Hoar wrote:
>>>>>>>> This is a separate issue from the parallel "Tokenization
problem" thread...
>>>>>>>>
>>>>>>>> The tokenizer.perl has had one line that transforms the
grave accent (`)
>>>>>>>> to apostrophe and another that transforms double
apostrophe ('') to to
>>>>>>>> single quote. I suspect these have been in the script
since the
>>>>>>>> beginning. However, they recently "bit" me on a recent
project. Easy
>>>>>>>> enough to work around.
>>>>>>>>
>>>>>>>> Still, I'm wondering. Do they still belong in the
tokenizer.perl script?
>>>>>>>> Or, should they moved into one of the other scripts? The
>>>>>>>> normalize-punctuation.perl script seems to be a good
candidate.
>>>>>>>> _______________________________________________
>>>>>>>> Moses-support mailing list
>>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> [email protected] <mailto:[email protected]>
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> [email protected] <mailto:[email protected]>
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> [email protected] <mailto:[email protected]>
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected] <mailto:[email protected]>
>> http://mailman.mit.edu/mailman/listinfo/moses-support
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
Moses-support mailing list
[email protected] <mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support
--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support