Re: [Moses-support] Tokenizer ignores hyphen

Raphael Payen Wed, 17 Mar 2010 11:02:59 -0700

2010/3/17 maria sol ferrer <[email protected]>:
> Hi, i dont know if it is the default setting and there is an option to
> change that, but the tokenizer script is ignoring hyphens (-) and I would
> like it to separate them as different tokens, so that for example:
> "high-energy" is tokenized as "high - energy" and not just one token as it
> is doing now...


Quick answer: instead of
tokenizer.perl <file>
use
sed "s%-% - %g" <file> | tokenizer.perl

Then when you detokenize you might want to use
sed "s% - %-%g" <file> | detokenizer.perl

(If you do this, however, all the hyphens in your file will be
"detokenized", so if your had sentences containing things like
"John - or maybe it was Jack", or "51 - 42 = 9", they will be
changed to "John-or maybe it was Jack" and "51-42 = 9".)

-- 
Raphael Payen
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Tokenizer ignores hyphen

Reply via email to