[
https://issues.apache.org/jira/browse/OPENNLP-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478583#comment-17478583
]
ASF GitHub Bot commented on OPENNLP-1353:
-----------------------------------------
rw026 commented on pull request #402:
URL: https://github.com/apache/opennlp/pull/402#issuecomment-1016348359
You're welcome!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> DictonaryLemmatizer missing charset
> -----------------------------------
>
> Key: OPENNLP-1353
> URL: https://issues.apache.org/jira/browse/OPENNLP-1353
> Project: OpenNLP
> Issue Type: Bug
> Components: Lemmatizer
> Affects Versions: 1.9.3
> Environment: Windows 10
> Reporter: Robert
> Priority: Major
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> The initialization of the DictonaryLemmatizer is not decoding the inputstream
> correctly due to missing charset.
> My dictionary file for the lemmatizer is utf-8 encoded. At
> DictonaryLemmatizer initialization the system fallback encoding is used
> because no charset is specified for the InputStreamReader. In my case
> windows-1252. This leads to the problem that the correct lemmas of words are
> not found.
> E.g. My {{lemma.dict}} file contains following line (utf-8):
> {code:java}
> mäuse NN maus //German word of mice
> {code}
> And the InputStreamReader decodes it as windows-1252:
> {code:java}
> mäuse NN maus
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)