[jira] [Created] (OPENNLP-1353) DictonaryLemmatizer missing charset

Robert (Jira) Mon, 17 Jan 2022 07:36:13 -0800

Robert created OPENNLP-1353:
-------------------------------

             Summary: DictonaryLemmatizer missing charset
                 Key: OPENNLP-1353
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1353
             Project: OpenNLP
          Issue Type: Bug
          Components: Lemmatizer
    Affects Versions: 1.9.3
         Environment: Windows 10
            Reporter: Robert



The initialization of the DictonaryLemmatizer is not decoding the inputstream 
correctly due to missing charset.

My dictionary file for the lemmatizer is utf-8 encoded. At DictonaryLemmatizer 
initialization the system fallback encoding is used because no charset is 
specified for the InputStream. In my case windows-1252. This leads to the 
problem that the correct lemmas of words are not found.

E.g. My {{lemma.dict}} file contains following line:
mäuse      NN     maus
will be decoded to:
mÃ¤use    NN    maus



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (OPENNLP-1353) DictonaryLemmatizer missing charset

Reply via email to