[jira] [Commented] (OPENNLP-1218) All binary implementation of AbstractModelWriter and DataReader throws java.io.UTFDataFormatException when large dataset is used for training

Martin Wiesner (Jira) Tue, 01 Nov 2022 09:10:40 -0700


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627227#comment-17627227
 ]


Martin Wiesner commented on OPENNLP-1218:
-----------------------------------------

This is an open duplicate of #OPENNLP-1366 which got resolved recently. The 
fix(es) will be released with OpenNLP version 2.0.1.

[~jzemerick]: Could you mark this one as resolved / duplicate with the above 
fix version?
I applied the fix in #OPENNLP-1366 also for _BinaryNaiveBayesModelWriter_ as 
reported by [~sudheerprem] for his case.

Details see: [https://github.com/apache/opennlp/pull/427]

> All binary implementation of AbstractModelWriter and DataReader  throws 
> java.io.UTFDataFormatException when large dataset is used for training
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-1218
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1218
>             Project: OpenNLP
>          Issue Type: Bug
>    Affects Versions: 1.8.4
>            Reporter: Sudheer Prem
>            Priority: Major
>
> All binary implementation of AbstractModelWriter and DataReader throws 
> "java.io.UTFDataFormatException: encoded string too long" in the 
> java.io.DataOutputStream.writeUTF method call when a large dataset (more than 
> 64 KB) is used for training. Looks like, this is a known limitation of 
> java.io.DataOutputStream.writeUTF method.
> Following is the stack trace:
> java.io.UTFDataFormatException: encoded string too long: 97519 bytes
> at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
>  at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
>  at 
> opennlp.tools.ml.naivebayes.BinaryNaiveBayesModelWriter.writeUTF(BinaryNaiveBayesModelWriter.java:67)
>  at 
> opennlp.tools.ml.naivebayes.NaiveBayesModelWriter.persist(NaiveBayesModelWriter.java:169)
>  at 
> opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75)
>  at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71)
>  at 
> opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36)
>  at 
> opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29)
>  at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597)
>  
> The implementation should use byte array to resolve this issue. 
> Following is the fix to resolve this issue.
>  
> public void writeUTF(String s) throws java.io.IOException {
> byte[] ctxByte = s.getBytes("utf-8");
> output.writeInt(ctxByte.length);
> output.write(ctxByte); 
> //output.writeUTF(s);
> }
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (OPENNLP-1218) All binary implementation of AbstractModelWriter and DataReader throws java.io.UTFDataFormatException when large dataset is used for training

Reply via email to