Pei Chen created CTAKES-44:
------------------------------
Summary: Tokenizer does not handle Windows-style newlines
Key: CTAKES-44
URL: https://issues.apache.org/jira/browse/CTAKES-44
Project: cTAKES
Issue Type: Improvement
Reporter: Pei Chen
Priority: Minor
Moved from SF Tracker: 3307765
The Tokenizer is not Windows-newline friendly. If there are \r\n newlines in
the text they will not be handled well. Three Tokens get returned for each
"\r\n" instance.
- getEndOfLineTokens creates a NewlineToken for both the \r and the \n. That is
two NewlineTokens.
- getRawTokens creates a SymbolToken for each \n\r that stands alone on its own
line (ie, a blank line)
- the final for loop in tokenize creates a SymbolToken for each newline that
ends a line with something else on it.
Since each newline either shares a line or does not, this is three tokens for
each \r\n.
In a related note, getRawTokens returns the last raw token as "word\r" when the
\r\n newline shares its line with some preceding text.
This was noticed because we have newlines mid-sentence in many of our clinical
notes and so we do not split sentences on newlines and then we noticed that
sometimes the negation and status annotators wouldn't work if there was a
newline instead of a space between the negation word and the named entity
because the extra tokens caused the negation word to be too far in terms of
intervening tokens.
If
--------------------------------------------------------------------------------
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira