[
https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148167#comment-13148167
]
James Kosin commented on OPENNLP-367:
-------------------------------------
Joern,
That is one of the issues. I was looking at the code and it looks like someone
has taken out the gets to get the encoding from the CLI. The CoNLL 2002 code
and the CoNLL 2003 code now has hard coded encodings when opening the files....
and I think I may have fixed one issue which you had with the CoNLL 2002 data
encoding by specifying the -Dfile.encoding=UTF-8 may have fixed the System.out
issue with encoding. Just didn't realize it at the time.
Anyway, I just want to put this issue to bed once and for all by encapsolating
the file open/reading/and etc into a class and refactor. So we don't have to
remember we need to do this and this and that for every new addition.
I was planing on first determining why everything isn't working... Which may
just be a Windows thing since Linux is leaning more these days to a UTF-8
encoding for the entire OS.
Also, I always convert from the original sources whenever possible when doing
my tests. For example, I have the 1 file eng.train, eng.testa and eng.testb
that haven't been converted for the English 2003 data. I've added the CoNLL
2002 data that hasn't been converted either. This way I can test most of the
system for the NameFinder.
> File Encoding Issues
> --------------------
>
> Key: OPENNLP-367
> URL: https://issues.apache.org/jira/browse/OPENNLP-367
> Project: OpenNLP
> Issue Type: Bug
> Components: Command Line Interface
> Affects Versions: tools-1.5.2-incubating
> Environment: All
> Reporter: James Kosin
> Assignee: James Kosin
> Labels: encoding, rework, training
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly
> handled. A good example is the CoNLL 2002 data if correctly encoded in UTF-8
> does not correctly work for training without specifying -Dfile.encoding=UTF-8
> for the Java Command.
> We already specify the input and expected output encoding on the cmdline
> interface with the -encoding paramter. For some reason this isn't being
> followed.
> I'll work on fixing this for the next major release... :-)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira