Re: [Moses-support] tokenizer.perl to not tokenize exclude URLs

Eleftherios Avramidis Tue, 15 Oct 2013 07:19:59 -0700

Hi Barry,

thanks, both tokenizer and detokenizer work as you said. Problem solved.


best
Lefteris


On 14/10/13 21:38, Barry Haddow wrote:
> Hi Lefty
>
> For the 'protect' option, the format is one regular expression per 
> line. For example if you use a file with one line like this:
>
> http://\S+
>
> then it should protect some URLs from tokenisation. It works for me. 
> If you have problems then send me the file.
>
> For the -a option, I think the detokeniser should put the hyphens back 
> together again, but I have not checked.
>
> cheers - Barry
>
> On 14/10/13 19:22, Eleftherios Avramidis wrote:
>> Hi,
>>
>> I see tokenizer.perl now offers an option for excluding URLs and other
>> expressions. "  -protect FILE  ... specify file with patters to be
>> protected in tokenisation." Unfortunately there is no explanation of how
>> this optional file should be. I tried several ways of writing regular
>> expressions for URLs, but URLs still come out tokenized. Could you
>> provide an example?
>>
>> My second question concerns the -a option, for aggressive hyphen
>> splitting. Does the detokenizer offer a similar option, to reconstructed
>> separeted hyphens?
>>
>> cheers
>> Lefteris
>>
>


-- 
MSc. Inf. Eleftherios Avramidis
DFKI GmbH, Alt-Moabit 91c, 10559 Berlin
Tel. +49-30 238 95-1806

Fax. +49-30 238 95-1810

-------------------------------------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------------------------------------

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] tokenizer.perl to not tokenize exclude URLs

Reply via email to