On Jan 16, 2015, at 12:46 PM, Hieu Hoang wrote:
> i think it's too difficult to police.
You'd probably need a regression test that checks if the tokenised output is
still the same so changes don't go unnoticed. But of course it's still some
extra work.
> Another idea is to get the script to md5 its own source code, and the
> non-prefix files it uses.
That would definitely be better than nothing, even though it would raise false
alarms from time to time.
>
> On 16/01/15 11:12, Christian Hardmeier wrote:
>> On Jan 16, 2015, at 11:51 AM, Tom Hoar wrote:
>>
>>> I agree with versioning. Could be added to the command line.
>>>
>>> Also agree that this proposed change qualifies as a version change.
>>>
>>> How to you propose managing the issue of output changes due to
>>> command-line switches, like -no-escape?
>> Very good question. To be consistent, you'd probably have to increment the
>> version number even if the change only applies when you use a certain
>> command-line switch. But not if it doesn't affect the input, and maybe not
>> if you just add a new command-line switch that is off by default. What do
>> you think?
>>
>>
>>
>>>
>>> On 01/16/2015 05:36 PM, Christian Hardmeier wrote:
>>>> I'd like to suggest that there should be a version number in the tokeniser
>>>> that is incremented whenever the output changes, even if the change is
>>>> minor and even if it's just a bugfix. Otherwise when you pull a new
>>>> version of moses you don't know if the output of tokenizer.perl is still
>>>> compatible with your existing models. (Moving functionality from
>>>> tokenizer.perl to normalize-punctuation.perl would count as a change from
>>>> my point of view. I don't always use normalize-punctutation.)
>>>>
>>>> /Christian
>>>>
>>>> On Jan 16, 2015, at 10:36 AM, Hieu Hoang wrote:
>>>>
>>>>> it's probably a good idea to make this change. If you've done it
>>>>> already, please send me the updated scripts and I'll check it in. If
>>>>> not, I'll do it myself
>>>>>
>>>>> there's hopefully a fast, C++ tokenizer replacement coming soon.
>>>>> Highlighting these issues now is useful to understanding exactly how the
>>>>> tokenizer works/should work
>>>>>
>>>>> On 15/01/15 01:52, Tom Hoar wrote:
>>>>>> This is a separate issue from the parallel "Tokenization problem"
>>>>>> thread...
>>>>>>
>>>>>> The tokenizer.perl has had one line that transforms the grave accent (`)
>>>>>> to apostrophe and another that transforms double apostrophe ('') to to
>>>>>> single quote. I suspect these have been in the script since the
>>>>>> beginning. However, they recently "bit" me on a recent project. Easy
>>>>>> enough to work around.
>>>>>>
>>>>>> Still, I'm wondering. Do they still belong in the tokenizer.perl script?
>>>>>> Or, should they moved into one of the other scripts? The
>>>>>> normalize-punctuation.perl script seems to be a good candidate.
>>>>>> _______________________________________________
>>>>>> Moses-support mailing list
>>>>>> [email protected]
>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> [email protected]
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>> _______________________________________________
>>>> Moses-support mailing list
>>>> [email protected]
>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support