Pei Chen created CTAKES-44:
------------------------------

             Summary: Tokenizer does not handle Windows-style newlines
                 Key: CTAKES-44
                 URL: https://issues.apache.org/jira/browse/CTAKES-44
             Project: cTAKES
          Issue Type: Improvement
            Reporter: Pei Chen
            Priority: Minor


Moved from SF Tracker: 3307765 
 
The Tokenizer is not Windows-newline friendly. If there are \r\n newlines in 
the text they will not be handled well. Three Tokens get returned for each 
"\r\n" instance.

- getEndOfLineTokens creates a NewlineToken for both the \r and the \n. That is 
two NewlineTokens.
- getRawTokens creates a SymbolToken for each \n\r that stands alone on its own 
line (ie, a blank line)
- the final for loop in tokenize creates a SymbolToken for each newline that 
ends a line with something else on it.

Since each newline either shares a line or does not, this is three tokens for 
each \r\n.

In a related note, getRawTokens returns the last raw token as "word\r" when the 
\r\n newline shares its line with some preceding text.

This was noticed because we have newlines mid-sentence in many of our clinical 
notes and so we do not split sentences on newlines and then we noticed that 
sometimes the negation and status annotators wouldn't work if there was a 
newline instead of a space between the negation word and the named entity 
because the extra tokens caused the negation word to be too far in terms of 
intervening tokens.

If 


--------------------------------------------------------------------------------


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to