[jira] [Created] (OPENNLP-1218) All binary implementation of AbstractModelWriter and DataReader throws java.io.UTFDataFormatException when large dataset is used for training

Sudheer Prem (JIRA) Wed, 29 Aug 2018 21:41:17 -0700

Sudheer Prem created OPENNLP-1218:
-------------------------------------

             Summary: All binary implementation of AbstractModelWriter and 
DataReader  throws java.io.UTFDataFormatException when large dataset is used 
for training
                 Key: OPENNLP-1218
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1218
             Project: OpenNLP
          Issue Type: Bug
    Affects Versions: 1.8.4
            Reporter: Sudheer Prem



All binary implementation of AbstractModelWriter and DataReader throws 
"java.io.UTFDataFormatException: encoded string too long" in the 
java.io.DataOutputStream.writeUTF method call when a large dataset (more than 
64 KB) is used for training. Looks like, this is a known limitation of 
java.io.DataOutputStream.writeUTF method.

Following is the stack trace:

java.io.UTFDataFormatException: encoded string too long: 97519 bytes

at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
 at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
 at 
opennlp.tools.ml.naivebayes.BinaryNaiveBayesModelWriter.writeUTF(BinaryNaiveBayesModelWriter.java:67)
 at 
opennlp.tools.ml.naivebayes.NaiveBayesModelWriter.persist(NaiveBayesModelWriter.java:169)
 at 
opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75)
 at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71)
 at 
opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36)
 at 
opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29)
 at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597)

 

The implementation should use byte array to resolve this issue. 

Following is the fix to resolve this issue.

 

public void writeUTF(String s) throws java.io.IOException {
byte[] ctxByte = s.getBytes("utf-8");
output.writeInt(ctxByte.length);
output.write(ctxByte); 
//output.writeUTF(s);
}

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (OPENNLP-1218) All binary implementation of AbstractModelWriter and DataReader throws java.io.UTFDataFormatException when large dataset is used for training

Reply via email to