On Feb 5, 2013, at 1:03 PM, "Masanz, James J." <[email protected]> wrote:
> I propose having cTAKES, by default, accept UTF8 - not just (basic) ASCII

Yes please. Anything that is replacing character instead of using the correct 
encoding is just a bug waiting to happen later.

> One consideration is that none of the training data used for the sentence 
> detector, part of speech tagger or chunker included such characters.

Might be worth running the current models over such text just to make sure 
things don't break horribly. I wouldn't expect them to, but you never know…

Steve

Reply via email to