martin created OPENNLP-1099:
-------------------------------
Summary: Is this a typical tokenization issue?
Key: OPENNLP-1099
URL: https://issues.apache.org/jira/browse/OPENNLP-1099
Project: OpenNLP
Issue Type: Bug
Components: Lemmatizer
Reporter: martin
I am testing openNLP and found some significant tokenization issue involving
punctuation.
Thank you Costco!
i love costco!
I love Costco!!
FUCK IKEA.
In all these cases, the last punctuation is not split so "Costco!" and "IKEA."
are treated as one token. This looks like a systematic problem.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)