[ https://issues.apache.org/jira/browse/OPENNLP-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627227#comment-17627227 ]
Martin Wiesner commented on OPENNLP-1218: ----------------------------------------- This is an open duplicate of #OPENNLP-1366 which got resolved recently. The fix(es) will be released with OpenNLP version 2.0.1. [~jzemerick]: Could you mark this one as resolved / duplicate with the above fix version? I applied the fix in #OPENNLP-1366 also for _BinaryNaiveBayesModelWriter_ as reported by [~sudheerprem] for his case. Details see: [https://github.com/apache/opennlp/pull/427] > All binary implementation of AbstractModelWriter and DataReader throws > java.io.UTFDataFormatException when large dataset is used for training > ---------------------------------------------------------------------------------------------------------------------------------------------- > > Key: OPENNLP-1218 > URL: https://issues.apache.org/jira/browse/OPENNLP-1218 > Project: OpenNLP > Issue Type: Bug > Affects Versions: 1.8.4 > Reporter: Sudheer Prem > Priority: Major > > All binary implementation of AbstractModelWriter and DataReader throws > "java.io.UTFDataFormatException: encoded string too long" in the > java.io.DataOutputStream.writeUTF method call when a large dataset (more than > 64 KB) is used for training. Looks like, this is a known limitation of > java.io.DataOutputStream.writeUTF method. > Following is the stack trace: > java.io.UTFDataFormatException: encoded string too long: 97519 bytes > at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364) > at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323) > at > opennlp.tools.ml.naivebayes.BinaryNaiveBayesModelWriter.writeUTF(BinaryNaiveBayesModelWriter.java:67) > at > opennlp.tools.ml.naivebayes.NaiveBayesModelWriter.persist(NaiveBayesModelWriter.java:169) > at > opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75) > at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71) > at > opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36) > at > opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29) > at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597) > > The implementation should use byte array to resolve this issue. > Following is the fix to resolve this issue. > > public void writeUTF(String s) throws java.io.IOException { > byte[] ctxByte = s.getBytes("utf-8"); > output.writeInt(ctxByte.length); > output.write(ctxByte); > //output.writeUTF(s); > } > -- This message was sent by Atlassian Jira (v8.20.10#820010)