Re: [DISCUSS] FW: [jira] [Created] (CTAKES-145) inconsistent handling of upper ascii

Steven Bethard Tue, 05 Feb 2013 12:11:36 -0800

On Feb 5, 2013, at 1:03 PM, "Masanz, James J." <[email protected]> wrote:
> I propose having cTAKES, by default, accept UTF8 - not just (basic) ASCII


Yes please. Anything that is replacing character instead of using the correct 
encoding is just a bug waiting to happen later.

> One consideration is that none of the training data used for the sentence 
> detector, part of speech tagger or chunker included such characters.

Might be worth running the current models over such text just to make sure 
things don't break horribly. I wouldn't expect them to, but you never know…

Steve

Re: [DISCUSS] FW: [jira] [Created] (CTAKES-145) inconsistent handling of upper ascii

Reply via email to