[jira] [Updated] (OPENNLP-1202) Word tokenization

Bharani Sruthi (Jira) Sat, 22 Aug 2020 20:46:42 -0700


     [ 
https://issues.apache.org/jira/browse/OPENNLP-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bharani Sruthi updated OPENNLP-1202:
------------------------------------
    Attachment: OpenNLPSampleProgramOutput.png

> Word tokenization 
> ------------------
>
>                 Key: OPENNLP-1202
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1202
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: language model
>         Environment: Windows Server 2016, R version 3.3.3
>            Reporter: Dippy Aggarwal
>            Priority: Major
>              Labels: Annotations
>         Attachments: OpenNLPSampleProgramOutput.png, openNLP-output.png, 
> openNLPTest.r
>
>
> Came across an issue for identifying words in a sentence. For words such as 
> *can't*, the tokenization using openNLP yields two words: "ca" and "n't"
> As an example (captured in the screenshot), see the tokenization for the 
> string
> *When heard the Xenogears soundtrack, so can't really describe.*
> Note the words marked by ID's 9 and 10 in the openNLP-output.png file. 
> Not sure if I am missing any parameters that would produce the correct 
> result? 
> Would appreciate any ideas/community's attention to this issue. Thanks. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (OPENNLP-1202) Word tokenization

Reply via email to