Re: [Moses-support] tokenizer.perl to not tokenize exclude URLs

Barry Haddow Mon, 14 Oct 2013 12:41:36 -0700

Hi Lefty

For the 'protect' option, the format is one regular expression per line. 
For example if you use a file with one line like this:


http://\S+

then it should protect some URLs from tokenisation. It works for me. If 
you have problems then send me the file.

For the -a option, I think the detokeniser should put the hyphens back 
together again, but I have not checked.

cheers - Barry

On 14/10/13 19:22, Eleftherios Avramidis wrote:
> Hi,
>
> I see tokenizer.perl now offers an option for excluding URLs and other
> expressions. "  -protect FILE  ... specify file with patters to be
> protected in tokenisation." Unfortunately there is no explanation of how
> this optional file should be. I tried several ways of writing regular
> expressions for URLs, but URLs still come out tokenized. Could you
> provide an example?
>
> My second question concerns the -a option, for aggressive hyphen
> splitting. Does the detokenizer offer a similar option, to reconstructed
> separeted hyphens?
>
> cheers
> Lefteris
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] tokenizer.perl to not tokenize exclude URLs

Reply via email to