martin created OPENNLP-1099:
-------------------------------

             Summary: Is this a typical tokenization issue?
                 Key: OPENNLP-1099
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1099
             Project: OpenNLP
          Issue Type: Bug
          Components: Lemmatizer
            Reporter: martin


I am testing openNLP and found some significant tokenization issue involving 
punctuation.  

Thank you Costco!
i love costco!
I love Costco!!
FUCK IKEA.

In all these cases, the last punctuation is not split so "Costco!" and "IKEA." 
are treated as one token. This looks like a systematic problem. 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to