Re: [Moses-support] Legacy tokenizer.perl functionality.

Ondrej Bojar Fri, 16 Jan 2015 07:10:18 -0800

Hi, Christian,

when the scripts directory of moses was first created back in 2006, we had the 
same issues with versioning. At that point, I created the (ugly) need 'install' 
the scripts, mainly to provide all of them with a version number. Fortunately, 
we now got rid of this and the scripts are meant to be used rightaway after 
checkout.


I'm saying this just to point out that there is probably no ideal way of 
keeping up to date and yet ensuring compatibility for existing models with 
toolkits as complex as moses is.

For this, I use my eman, an experiment manager where even moses toolkit itself 
is something timestamped. So I have a a couple of moses checkouts, timestamped, 
and my models depend on one of them. Moving to a fresher moses checkout is easy 
(a new timestamped directory gets created), but requires to redo all the models 
(well, eman does this for me, so it's just waste of computer space and time, 
not mine).

Cheers, O.

----- Original Message -----
> From: "Christian Hardmeier" <[email protected]>
> To: "Hieu Hoang" <[email protected]>
> Cc: "Tom Hoar" <[email protected]>, "moses-support 
> support" <[email protected]>
> Sent: Friday, 16 January, 2015 15:26:15
> Subject: Re: [Moses-support] Legacy tokenizer.perl functionality.

> On Jan 16, 2015, at 12:46 PM, Hieu Hoang wrote:
> 
>> i think it's too difficult to police.
> 
> You'd probably need a regression test that checks if the tokenised output is
> still the same so changes don't go unnoticed. But of course it's still some
> extra work.
> 
>> Another idea is to get the script to md5 its own source code, and the 
>> non-prefix
>> files it uses.
> 
> That would definitely be better than nothing, even though it would raise false
> alarms from time to time.
> 
>> 
>> On 16/01/15 11:12, Christian Hardmeier wrote:
>>> On Jan 16, 2015, at 11:51 AM, Tom Hoar wrote:
>>> 
>>>> I agree with versioning. Could be added to the command line.
>>>> 
>>>> Also agree that this proposed change qualifies as a version change.
>>>> 
>>>> How to you propose managing the issue of output changes due to
>>>> command-line switches, like -no-escape?
>>> Very good question. To be consistent, you'd probably have to increment the
>>> version number even if the change only applies when you use a certain
>>> command-line switch. But not if it doesn't affect the input, and maybe not 
>>> if
>>> you just add a new command-line switch that is off by default. What do you
>>> think?
>>> 
>>> 
>>> 
>>>> 
>>>> On 01/16/2015 05:36 PM, Christian Hardmeier wrote:
>>>>> I'd like to suggest that there should be a version number in the 
>>>>> tokeniser that
>>>>> is incremented whenever the output changes, even if the change is minor 
>>>>> and
>>>>> even if it's just a bugfix. Otherwise when you pull a new version of 
>>>>> moses you
>>>>> don't know if the output of tokenizer.perl is still compatible with your
>>>>> existing models. (Moving functionality from tokenizer.perl to
>>>>> normalize-punctuation.perl would count as a change from my point of view. 
>>>>> I
>>>>> don't always use normalize-punctutation.)
>>>>> 
>>>>> /Christian
>>>>> 
>>>>> On Jan 16, 2015, at 10:36 AM, Hieu Hoang wrote:
>>>>> 
>>>>>> it's probably a good idea to make this change. If you've done it
>>>>>> already, please send me the updated scripts and I'll check it in. If
>>>>>> not, I'll do it myself
>>>>>> 
>>>>>> there's hopefully a fast, C++ tokenizer replacement coming soon.
>>>>>> Highlighting these issues now is useful to understanding exactly how the
>>>>>> tokenizer works/should work
>>>>>> 
>>>>>> On 15/01/15 01:52, Tom Hoar wrote:
>>>>>>> This is a separate issue from the parallel "Tokenization problem" 
>>>>>>> thread...
>>>>>>> 
>>>>>>> The tokenizer.perl has had one line that transforms the grave accent (`)
>>>>>>> to apostrophe and another that transforms double apostrophe ('') to to
>>>>>>> single quote. I suspect these have been in the script since the
>>>>>>> beginning. However, they recently "bit" me on a recent project. Easy
>>>>>>> enough to work around.
>>>>>>> 
>>>>>>> Still, I'm wondering. Do they still belong in the tokenizer.perl script?
>>>>>>> Or, should they moved into one of the other scripts? The
>>>>>>> normalize-punctuation.perl script seems to be a good candidate.
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> [email protected]
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> [email protected]
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> [email protected]
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>> 
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>> 
>> 
> 
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

-- 
Ondrej Bojar (mailto:[email protected] / [email protected])
http://www.cuni.cz/~obo
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Legacy tokenizer.perl functionality.

Reply via email to