Re: [Moses-support] tokenizer.perl to not tokenize exclude URLs

Barry Haddow Thu, 17 Oct 2013 07:14:40 -0700

Hi Lefty

Thanks for pointing that out - I fixed it,


cheers - Barry

On 16/10/13 14:09, Eleftherios Avramidis wrote:
> Hi Barry,
>
> I found a typo/bug that explains why it hasn't worked so far here: the 
> help message of tokenizer.perl said that the parameter is "-protect", 
> but in fact it is "-protected".
>
> best
> Lefteris
>
>
>
> On 14/10/13 21:38, Barry Haddow wrote:
>> Hi Lefty
>>
>> For the 'protect' option, the format is one regular expression per 
>> line. For example if you use a file with one line like this:
>>
>> http://\S+
>>
>> then it should protect some URLs from tokenisation. It works for me. 
>> If you have problems then send me the file.
>>
>> For the -a option, I think the detokeniser should put the hyphens 
>> back together again, but I have not checked.
>>
>> cheers - Barry
>>
>> On 14/10/13 19:22, Eleftherios Avramidis wrote:
>>> Hi,
>>>
>>> I see tokenizer.perl now offers an option for excluding URLs and other
>>> expressions. "  -protect FILE  ... specify file with patters to be
>>> protected in tokenisation." Unfortunately there is no explanation of 
>>> how
>>> this optional file should be. I tried several ways of writing regular
>>> expressions for URLs, but URLs still come out tokenized. Could you
>>> provide an example?
>>>
>>> My second question concerns the -a option, for aggressive hyphen
>>> splitting. Does the detokenizer offer a similar option, to 
>>> reconstructed
>>> separeted hyphens?
>>>
>>> cheers
>>> Lefteris
>>>
>>
>
>


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] tokenizer.perl to not tokenize exclude URLs

Reply via email to