James Joseph Masanz created CTAKES-227:
------------------------------------------
Summary: Broca's -> PunctuationToken instead of ContractionToken -
caused by apostrophe seen as sentence ending
Key: CTAKES-227
URL: https://issues.apache.org/jira/browse/CTAKES-227
Project: cTAKES
Issue Type: Bug
Components: ctakes-core
Affects Versions: 3.1
Reporter: James Joseph Masanz
Assignee: James Joseph Masanz
The recently rebuilt sentence detector (currently in trunk and the 3.1.0
branch) is sometimes taking the apostrophe as a sentence break where the
ctakes-3.0.0-incubating model didn’t.
The training data used for the recently rebuilt model only contains only 7
lines that end with an apostrophe (single quote) followed immediately by a
newline
It has >100K occurrences of 's
It has >175K occurrences of the ' character in all.
The place I noticed this is in testfakenote.txt.xml in ctakes-regression-test.
The word "Broca's" used to have a ContractionToken but since a sentence is now
ending on the apostrophe, the apostrophe is getting annotated as a
PunctuationToken.
See more in the thread started at
http://markmail.org/message/wavipejszlspzo5u
including examples that split correctly and incorrectly.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira