[jira] [Commented] (OPENNLP-1353) DictonaryLemmatizer missing charset

ASF GitHub Bot (Jira) Wed, 19 Jan 2022 03:13:08 -0800


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478583#comment-17478583
 ]


ASF GitHub Bot commented on OPENNLP-1353:
-----------------------------------------

rw026 commented on pull request #402:
URL: https://github.com/apache/opennlp/pull/402#issuecomment-1016348359


   You're welcome!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> DictonaryLemmatizer missing charset
> -----------------------------------
>
>                 Key: OPENNLP-1353
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1353
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Lemmatizer
>    Affects Versions: 1.9.3
>         Environment: Windows 10
>            Reporter: Robert
>            Priority: Major
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The initialization of the DictonaryLemmatizer is not decoding the inputstream 
> correctly due to missing charset.
> My dictionary file for the lemmatizer is utf-8 encoded. At 
> DictonaryLemmatizer initialization the system fallback encoding is used 
> because no charset is specified for the InputStreamReader. In my case 
> windows-1252. This leads to the problem that the correct lemmas of words are 
> not found.
> E.g. My {{lemma.dict}} file contains following line (utf-8):
> {code:java}
> mäuse      NN     maus   //German word of mice
> {code}
> And the InputStreamReader decodes it as windows-1252:
> {code:java}
> mÃ¤use    NN    maus
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (OPENNLP-1353) DictonaryLemmatizer missing charset

Reply via email to