Robert created OPENNLP-1353:
-------------------------------

             Summary: DictonaryLemmatizer missing charset
                 Key: OPENNLP-1353
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1353
             Project: OpenNLP
          Issue Type: Bug
          Components: Lemmatizer
    Affects Versions: 1.9.3
         Environment: Windows 10
            Reporter: Robert


The initialization of the DictonaryLemmatizer is not decoding the inputstream 
correctly due to missing charset.

My dictionary file for the lemmatizer is utf-8 encoded. At DictonaryLemmatizer 
initialization the system fallback encoding is used because no charset is 
specified for the InputStream. In my case windows-1252. This leads to the 
problem that the correct lemmas of words are not found.

E.g. My {{lemma.dict}} file contains following line:
mäuse      NN     maus
will be decoded to:
mäuse    NN    maus



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to