2010/3/17 maria sol ferrer <[email protected]>: > Hi, i dont know if it is the default setting and there is an option to > change that, but the tokenizer script is ignoring hyphens (-) and I would > like it to separate them as different tokens, so that for example: > "high-energy" is tokenized as "high - energy" and not just one token as it > is doing now...
Quick answer: instead of tokenizer.perl <file> use sed "s%-% - %g" <file> | tokenizer.perl Then when you detokenize you might want to use sed "s% - %-%g" <file> | detokenizer.perl (If you do this, however, all the hyphens in your file will be "detokenized", so if your had sentences containing things like "John - or maybe it was Jack", or "51 - 42 = 9", they will be changed to "John-or maybe it was Jack" and "51-42 = 9".) -- Raphael Payen _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
