Re: [Moses-support] tokenizer.perl to not tokenize exclude URLs

Barry Haddow Tue, 15 Oct 2013 01:47:02 -0700

Hi Tom

The implementation of 'protected' segments was fairly quick and simple, 
so in particular you'd at least have to turn the groupings in the 
expressions below into non-capturing groupings.


The protected segments were sufficient for my purposes, but if anyone 
wants to improve them, feel free...

cheers - Barry

On 15/10/13 02:47, Tom Hoar wrote:
> Thanks! This is another new handy feature. I suggest the "placeholders"
> functionality (separate thread) with this "protect" option could be a
> killer combination. Escape URLs with a token, for example @URL@, before
> tokenization. Then, protect this token during tokenization. You won't
> have to "fix" it afterwards, and you can define alternate URL
> translations during Moses runtime (example.com => example.ca)
>
> BTW, here's a more focused regular expression we use to identify URL's.
>
> (?i)\b((?:(?:(?:[a-z32][\w-]{1,6}:{1}/{2,3})[a-z0-9.\-_]+(:\d{1,5})?(/?))([^\s<>',\?\.]*([\.][a-z]{2,4})?)*(?:\?[^\s<>',\.]+)?))
>
> Here's another thatworks nicely. We found it at:
> http://daringfireball.net/2010/07/improved_regex_for_matching_urls
>
> (?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
>
>
>
>
> On 10/15/2013 02:38 AM, Barry Haddow wrote:
>> Hi Lefty
>>
>> For the 'protect' option, the format is one regular expression per line.
>> For example if you use a file with one line like this:
>>
>> http://\S+
>>
>> then it should protect some URLs from tokenisation. It works for me. If
>> you have problems then send me the file.
>>
>> For the -a option, I think the detokeniser should put the hyphens back
>> together again, but I have not checked.
>>
>> cheers - Barry
>>
>> On 14/10/13 19:22, Eleftherios Avramidis wrote:
>>> Hi,
>>>
>>> I see tokenizer.perl now offers an option for excluding URLs and other
>>> expressions. "  -protect FILE  ... specify file with patters to be
>>> protected in tokenisation." Unfortunately there is no explanation of how
>>> this optional file should be. I tried several ways of writing regular
>>> expressions for URLs, but URLs still come out tokenized. Could you
>>> provide an example?
>>>
>>> My second question concerns the -a option, for aggressive hyphen
>>> splitting. Does the detokenizer offer a similar option, to reconstructed
>>> separeted hyphens?
>>>
>>> cheers
>>> Lefteris
>>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] tokenizer.perl to not tokenize exclude URLs

Reply via email to