Martin Wiesner created OPENNLP-1512:
---------------------------------------
Summary: Fix incorrect encoding used in Conll02NameSampleStream
Key: OPENNLP-1512
URL: https://issues.apache.org/jira/browse/OPENNLP-1512
Project: OpenNLP
Issue Type: Improvement
Components: Formats, Name Finder
Affects Versions: 2.3.0
Reporter: Martin Wiesner
Assignee: Martin Wiesner
Fix For: 2.3.1
While working on OPENNLP-1190, I tested the example from the OpenNLP
documentation to convert the Esp.train example to the OpenNLP format, see:
[https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002]
I ran
opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per >
es_corpus_train_persons.txt
When I checked the output corpus (txt) file, I noticed incorrect symbols being
written there.
A quick debugging session revealed that the original files where ISO_8859_1
encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was
assumed. This results in accents or other special symbols of the spanish
alphabet being converted to garbage in the resulting UTF-8 encoded file
(reason: input character-set interpretation inconsistent).
Therefore, _Conll02NameSampleStream_ needs a fix to read the original files in
ISO_8859_1.
With this measure in place, the accents á, é, ... are correctly written to the
resulting converted training corpus file.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)