HI, Thank you very much! That's incredibly helpful. My one concern is that before I tokenized the input to the decoder it was crashing. Do you know what tokens would cause that behavior if left in? Would you recommend just not tokenizing path names and urls and leaving everything else?
Judah On Tue, Jul 15, 2014 at 4:02 AM, Barry Haddow <[email protected]> wrote: > Hi Judah > > The actual problem here is that you do not want path names split by the > tokeniser. It's only really set up to deal with regular text, but what you > can do is ask it to "protect" certain patterns by using the > > -protected <filename> > > argument. The file <filename> should contain a list of regular expressions > (one per line), and the tokeniser will not split apart any tokens which > match these REs. I'm guessing that in the example below you don't want > "tutorial" translated into the target language, and if the tokeniser > doesn't split the path then the whole thing will pass through as an OOV, > > cheers - Barry > > > On 14/07/14 16:53, Judah Schvimer wrote: > >> Hi, >> >> When I'm using the decoder I have to tokenize my target sentences before >> I translate them. However, when I detokenize them it leaves awkward spaces >> around what was tokenized. is there any way to fix this? It seems to be >> mainly around slashes and colons >> >> Source: :doc:`/tutorial/aggregation-zip-code-data-set` >> Target: : Doc: '/ tutorial / aggregation-zip-code-data-set' >> >> Thanks, >> Judah >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
