[
https://issues.apache.org/jira/browse/OPENNLP-367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151118#comment-13151118
]
Joern Kottmann commented on OPENNLP-367:
----------------------------------------
The transformed data from the formats package should be written to an output
file, where the user can also specify the encoding. The command line interface
might work slightly different on different platforms, and might as well be
confusing to use when used with data which cannot be encoded in the platform
default encoding.
The encoding in ConllXPOSSampleStream should either be hard coded or passed in,
but we should not use platform default.
Hardcoding to the encoding the CONLL-X data is distributed in, should be ok.
> File Encoding Issues
> --------------------
>
> Key: OPENNLP-367
> URL: https://issues.apache.org/jira/browse/OPENNLP-367
> Project: OpenNLP
> Issue Type: Bug
> Components: Command Line Interface
> Affects Versions: tools-1.5.2-incubating
> Environment: All
> Reporter: James Kosin
> Assignee: James Kosin
> Labels: encoding, rework, training
> Attachments: encoding.patch
>
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> The input and output encodings are not working correctly or are not properly
> handled. A good example is the CoNLL 2002 data if correctly encoded in UTF-8
> does not correctly work for training without specifying -Dfile.encoding=UTF-8
> for the Java Command.
> We already specify the input and expected output encoding on the cmdline
> interface with the -encoding paramter. For some reason this isn't being
> followed.
> I'll work on fixing this for the next major release... :-)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira