[
https://issues.apache.org/jira/browse/OPENNLP-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Martin Wiesner resolved OPENNLP-1218.
-------------------------------------
Fix Version/s: 2.1.0
Assignee: Martin Wiesner
Resolution: Duplicate
> All binary implementation of AbstractModelWriter and DataReader throws
> java.io.UTFDataFormatException when large dataset is used for training
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: OPENNLP-1218
> URL: https://issues.apache.org/jira/browse/OPENNLP-1218
> Project: OpenNLP
> Issue Type: Bug
> Affects Versions: 1.8.4
> Reporter: Sudheer Prem
> Assignee: Martin Wiesner
> Priority: Major
> Fix For: 2.1.0
>
>
> All binary implementation of AbstractModelWriter and DataReader throws
> "java.io.UTFDataFormatException: encoded string too long" in the
> java.io.DataOutputStream.writeUTF method call when a large dataset (more than
> 64 KB) is used for training. Looks like, this is a known limitation of
> java.io.DataOutputStream.writeUTF method.
> Following is the stack trace:
> java.io.UTFDataFormatException: encoded string too long: 97519 bytes
> at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
> at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
> at
> opennlp.tools.ml.naivebayes.BinaryNaiveBayesModelWriter.writeUTF(BinaryNaiveBayesModelWriter.java:67)
> at
> opennlp.tools.ml.naivebayes.NaiveBayesModelWriter.persist(NaiveBayesModelWriter.java:169)
> at
> opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75)
> at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71)
> at
> opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36)
> at
> opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29)
> at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597)
>
> The implementation should use byte array to resolve this issue.
> Following is the fix to resolve this issue.
>
> public void writeUTF(String s) throws java.io.IOException {
> byte[] ctxByte = s.getBytes("utf-8");
> output.writeInt(ctxByte.length);
> output.write(ctxByte);
> //output.writeUTF(s);
> }
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)