Re: [Moses-support] Detokenizer

Barry Haddow Tue, 15 Jul 2014 01:06:07 -0700

Hi Judah

The actual problem here is that you do not want path names split by the 
tokeniser. It's only really set up to deal with regular text, but what 
you can do is ask it to "protect" certain patterns by using the


-protected <filename>

argument. The file <filename> should contain a list of regular 
expressions (one per line), and the tokeniser will not split apart any 
tokens which match these REs. I'm guessing that in the example below you 
don't want "tutorial" translated into the target language, and if the 
tokeniser doesn't split the path then the whole thing will pass through 
as an OOV,

cheers - Barry

On 14/07/14 16:53, Judah Schvimer wrote:
> Hi,
>
> When I'm using the decoder I have to tokenize my target sentences 
> before I translate them. However, when I detokenize them it leaves 
> awkward spaces around what was tokenized. is there any way to fix 
> this? It seems to be mainly around slashes and colons
>
> Source: :doc:`/tutorial/aggregation-zip-code-data-set`
> Target: : Doc: '/ tutorial / aggregation-zip-code-data-set'
>
> Thanks,
> Judah
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Detokenizer

Reply via email to