[ 
https://issues.apache.org/jira/browse/OPENNLP-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478693#comment-17478693
 ] 

ASF GitHub Bot commented on OPENNLP-1353:
-----------------------------------------

jzonthemtn opened a new pull request #403:
URL: https://github.com/apache/opennlp/pull/403


   Just fixing a few checkstyle issues introduce in the PR for OPENNLP-1353.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> DictonaryLemmatizer missing charset
> -----------------------------------
>
>                 Key: OPENNLP-1353
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1353
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Lemmatizer
>    Affects Versions: 1.9.3
>         Environment: Windows 10
>            Reporter: Robert
>            Priority: Major
>             Fix For: 1.9.5
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The initialization of the DictonaryLemmatizer is not decoding the inputstream 
> correctly due to missing charset.
> My dictionary file for the lemmatizer is utf-8 encoded. At 
> DictonaryLemmatizer initialization the system fallback encoding is used 
> because no charset is specified for the InputStreamReader. In my case 
> windows-1252. This leads to the problem that the correct lemmas of words are 
> not found.
> E.g. My {{lemma.dict}} file contains following line (utf-8):
> {code:java}
> mäuse      NN     maus   //German word of mice
> {code}
> And the InputStreamReader decodes it as windows-1252:
> {code:java}
> mäuse    NN    maus
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to