Thanks! This is another new handy feature. I suggest the "placeholders" 
functionality (separate thread) with this "protect" option could be a 
killer combination. Escape URLs with a token, for example @URL@, before 
tokenization. Then, protect this token during tokenization. You won't 
have to "fix" it afterwards, and you can define alternate URL 
translations during Moses runtime (example.com => example.ca)

BTW, here's a more focused regular expression we use to identify URL's.

(?i)\b((?:(?:(?:[a-z32][\w-]{1,6}:{1}/{2,3})[a-z0-9.\-_]+(:\d{1,5})?(/?))([^\s<>',\?\.]*([\.][a-z]{2,4})?)*(?:\?[^\s<>',\.]+)?))

Here's another thatworks nicely. We found it at: 
http://daringfireball.net/2010/07/improved_regex_for_matching_urls

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))




On 10/15/2013 02:38 AM, Barry Haddow wrote:
> Hi Lefty
>
> For the 'protect' option, the format is one regular expression per line.
> For example if you use a file with one line like this:
>
> http://\S+
>
> then it should protect some URLs from tokenisation. It works for me. If
> you have problems then send me the file.
>
> For the -a option, I think the detokeniser should put the hyphens back
> together again, but I have not checked.
>
> cheers - Barry
>
> On 14/10/13 19:22, Eleftherios Avramidis wrote:
>> Hi,
>>
>> I see tokenizer.perl now offers an option for excluding URLs and other
>> expressions. "  -protect FILE  ... specify file with patters to be
>> protected in tokenisation." Unfortunately there is no explanation of how
>> this optional file should be. I tried several ways of writing regular
>> expressions for URLs, but URLs still come out tokenized. Could you
>> provide an example?
>>
>> My second question concerns the -a option, for aggressive hyphen
>> splitting. Does the detokenizer offer a similar option, to reconstructed
>> separeted hyphens?
>>
>> cheers
>> Lefteris
>>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to