On 16 January 2015 at 16:13, Barry Haddow <[email protected]> wrote:
> Hi > > Yes, the EMS (experiment management system) included with Moses will > also deal with this by checking timestamps on the tokeniser scripts. > > If you use your models outside the EMS (or Eman etc) however then > there's no easy way to ensure compatibility between tokeniser and model. > I agree that the tokeniser shouldn't be doing text normalisation, but it > was, and fixing it could cause more pain than leaving things as they are, > oh, i just moved the normalisation to normalize-punctuation.perl. It was a pain https://github.com/moses-smt/mosesdecoder/commit/19d7c44aad1d1b06b884833cc3a7b0e14d2a6c36 https://github.com/moses-smt/mosesdecoder/commit/30e31d4a95713cd5340941b15d566f09b3b1a2d7 Lets see what happens! > > cheers - Barry > > On 16/01/15 15:07, Ondrej Bojar wrote: > > Hi, Christian, > > > > when the scripts directory of moses was first created back in 2006, we > had the same issues with versioning. At that point, I created the (ugly) > need 'install' the scripts, mainly to provide all of them with a version > number. Fortunately, we now got rid of this and the scripts are meant to be > used rightaway after checkout. > > > > I'm saying this just to point out that there is probably no ideal way of > keeping up to date and yet ensuring compatibility for existing models with > toolkits as complex as moses is. > > > > For this, I use my eman, an experiment manager where even moses toolkit > itself is something timestamped. So I have a a couple of moses checkouts, > timestamped, and my models depend on one of them. Moving to a fresher moses > checkout is easy (a new timestamped directory gets created), but requires > to redo all the models (well, eman does this for me, so it's just waste of > computer space and time, not mine). > > > > Cheers, O. > > > > ----- Original Message ----- > >> From: "Christian Hardmeier" <[email protected]> > >> To: "Hieu Hoang" <[email protected]> > >> Cc: "Tom Hoar" <[email protected]>, "moses-support > support" <[email protected]> > >> Sent: Friday, 16 January, 2015 15:26:15 > >> Subject: Re: [Moses-support] Legacy tokenizer.perl functionality. > >> On Jan 16, 2015, at 12:46 PM, Hieu Hoang wrote: > >> > >>> i think it's too difficult to police. > >> You'd probably need a regression test that checks if the tokenised > output is > >> still the same so changes don't go unnoticed. But of course it's still > some > >> extra work. > >> > >>> Another idea is to get the script to md5 its own source code, and the > non-prefix > >>> files it uses. > >> That would definitely be better than nothing, even though it would > raise false > >> alarms from time to time. > >> > >>> On 16/01/15 11:12, Christian Hardmeier wrote: > >>>> On Jan 16, 2015, at 11:51 AM, Tom Hoar wrote: > >>>> > >>>>> I agree with versioning. Could be added to the command line. > >>>>> > >>>>> Also agree that this proposed change qualifies as a version change. > >>>>> > >>>>> How to you propose managing the issue of output changes due to > >>>>> command-line switches, like -no-escape? > >>>> Very good question. To be consistent, you'd probably have to > increment the > >>>> version number even if the change only applies when you use a certain > >>>> command-line switch. But not if it doesn't affect the input, and > maybe not if > >>>> you just add a new command-line switch that is off by default. What > do you > >>>> think? > >>>> > >>>> > >>>> > >>>>> On 01/16/2015 05:36 PM, Christian Hardmeier wrote: > >>>>>> I'd like to suggest that there should be a version number in the > tokeniser that > >>>>>> is incremented whenever the output changes, even if the change is > minor and > >>>>>> even if it's just a bugfix. Otherwise when you pull a new version > of moses you > >>>>>> don't know if the output of tokenizer.perl is still compatible with > your > >>>>>> existing models. (Moving functionality from tokenizer.perl to > >>>>>> normalize-punctuation.perl would count as a change from my point of > view. I > >>>>>> don't always use normalize-punctutation.) > >>>>>> > >>>>>> /Christian > >>>>>> > >>>>>> On Jan 16, 2015, at 10:36 AM, Hieu Hoang wrote: > >>>>>> > >>>>>>> it's probably a good idea to make this change. If you've done it > >>>>>>> already, please send me the updated scripts and I'll check it in. > If > >>>>>>> not, I'll do it myself > >>>>>>> > >>>>>>> there's hopefully a fast, C++ tokenizer replacement coming soon. > >>>>>>> Highlighting these issues now is useful to understanding exactly > how the > >>>>>>> tokenizer works/should work > >>>>>>> > >>>>>>> On 15/01/15 01:52, Tom Hoar wrote: > >>>>>>>> This is a separate issue from the parallel "Tokenization problem" > thread... > >>>>>>>> > >>>>>>>> The tokenizer.perl has had one line that transforms the grave > accent (`) > >>>>>>>> to apostrophe and another that transforms double apostrophe ('') > to to > >>>>>>>> single quote. I suspect these have been in the script since the > >>>>>>>> beginning. However, they recently "bit" me on a recent project. > Easy > >>>>>>>> enough to work around. > >>>>>>>> > >>>>>>>> Still, I'm wondering. Do they still belong in the tokenizer.perl > script? > >>>>>>>> Or, should they moved into one of the other scripts? The > >>>>>>>> normalize-punctuation.perl script seems to be a good candidate. > >>>>>>>> _______________________________________________ > >>>>>>>> Moses-support mailing list > >>>>>>>> [email protected] > >>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Moses-support mailing list > >>>>>>> [email protected] > >>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>>>>> _______________________________________________ > >>>>>> Moses-support mailing list > >>>>>> [email protected] > >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>>>> _______________________________________________ > >>>>> Moses-support mailing list > >>>>> [email protected] > >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>>> _______________________________________________ > >>>> Moses-support mailing list > >>>> [email protected] > >>>> http://mailman.mit.edu/mailman/listinfo/moses-support > >>>> > >> > >> _______________________________________________ > >> Moses-support mailing list > >> [email protected] > >> http://mailman.mit.edu/mailman/listinfo/moses-support > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
