On Feb 5, 2013, at 1:03 PM, "Masanz, James J." <[email protected]> wrote: > I propose having cTAKES, by default, accept UTF8 - not just (basic) ASCII
Yes please. Anything that is replacing character instead of using the correct encoding is just a bug waiting to happen later. > One consideration is that none of the training data used for the sentence > detector, part of speech tagger or chunker included such characters. Might be worth running the current models over such text just to make sure things don't break horribly. I wouldn't expect them to, but you never know… Steve
