Richard Zowalla created OPENNLP-1366:
----------------------------------------

             Summary:  Training of MaxEnt Model with large corpora fails with 
java.io.UTFDataFormatException
                 Key: OPENNLP-1366
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1366
             Project: OpenNLP
          Issue Type: Improvement
          Components: Lemmatizer
    Affects Versions: 1.9.4
            Reporter: Richard Zowalla


As written on 
[dev@opennlp.a.o|[https://lists.apache.org/thread/vc5lfzj81tco703noqxpvy8sfj8fw8b1]],
 we are working on training a large opennlp maxent model for lemmatizing
German texts. We use a wikipedia tree bank from Tübingen.

This consumes > 2 TB of RAM during training but will finish after some time. 
However, writing this model will result in a  java.io.UTFDataFormatException 

However, training such a big model isn't feasable for debugging. Gladly, a 
similar with a smaller dataset is found on Stackoverflow: 
[https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long]
 

It contains the OpenNLP CLI command to train a lemmatizer on a much smaller 
dataset.

The stacktrace is raced while writing a String as UTF in DataOutputStream, 
which has a hard-coded size limitation in the JDK (for reasons behind my 
knowledge ;))

Stacktrace:
{code:java}
java.io.UTFDataFormatException: encoded string too long: 383769 bytes
        at 
java.base/java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
        at 
java.base/java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
        at 
opennlp.tools.ml.maxent.io.BinaryGISModelWriter.writeUTF(BinaryGISModelWriter.java:71)
        at 
opennlp.tools.ml.maxent.io.GISModelWriter.persist(GISModelWriter.java:97)
        at 
opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75)
        at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71)
        at 
opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36)
        at 
opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29)
        at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597)
        at opennlp.tools.cmdline.CmdLineUtil.writeModel(CmdLineUtil.java:182)
        at 
opennlp.tools.cmdline.lemmatizer.LemmatizerTrainerTool.run(LemmatizerTrainerTool.java:77)
        at opennlp.tools.cmdline.CLI.main(CLI.java:256) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to