[jira] [Commented] (OPENNLP-1099) Is this a typical tokenization issue?

Suneel Marthi (JIRA) Thu, 29 Jun 2017 14:53:09 -0700

    [ 
https://issues.apache.org/jira/browse/OPENNLP-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069045#comment-16069045
 ]


Suneel Marthi commented on OPENNLP-1099:
----------------------------------------

dl4j uses Apache UIMA.  they could use either Lucene or OpenNLP for their 
tokenization.

> Is this a typical tokenization issue?
> -------------------------------------
>
>                 Key: OPENNLP-1099
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1099
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Lemmatizer
>            Reporter: martin
>             Fix For: 1.8.1
>
>
> I am testing openNLP and found some significant tokenization issue involving 
> punctuation.  
> Thank you Costco!
> i love costco!
> I love Costco!!
> FUCK IKEA.
> In all these cases, the last punctuation is not split so "Costco!" and 
> "IKEA." are treated as one token. This looks like a systematic problem. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (OPENNLP-1099) Is this a typical tokenization issue?

Reply via email to