Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Zowalla, Richard Tue, 12 Apr 2022 07:40:55 -0700

Hi Jeff,

thanks for the update.


We will give the change a try with a SNAPSHOT build including the
potential patch and start a run on the cluster with the Tübingen
Wikipedia Treebank. Guess we will have feedback in ~ 48 hours regarding
writeShort(...).

Gruß
Richard 

Am Dienstag, dem 12.04.2022 um 08:09 -0500 schrieb Jeff Zemerick:
> Luckily, this looks like a common problem [1] for years regarding
> writeUTF(). Following other guidance and the function's javadocs [2],
> writeUTF() writes the number of bytes written out followed by the
> string.
> Changing it to manually write the length of the string followed by
> write()
> allows the training to succeed. All unit tests pass and this seems to
> indicate it would be backward compatible because of unit tests that
> load
> models in src/test/resources/, but I want to verify that more to be
> sure.
> 
> Here's the changes:
> https://github.com/apache/opennlp/compare/master...jzonthemtn:OPENNLP-1366?expand=1
> 
> I am unsure of the writeShort() method for writing the length of the
> string. Even though it works for the UD data now, is that actually
> resolving the problem?
> 
> Anyone have any insights into this?
> 
> Thanks,
> Jeff
> 
> [1]
> https://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction
> [2]
> https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String)
> 
> On Mon, Apr 11, 2022 at 1:41 PM Jeff Zemerick <jzemer...@apache.org>
> wrote:
> 
> > Great, thanks. I was able to reproduce the problem. I'll take a
> > look and
> > keep this thread updated.
> > 
> > Thanks,
> > Jeff
> > 
> > On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard <
> > richard.zowa...@hs-heilbronn.de> wrote:
> > 
> > > Hi Jeff,
> > > 
> > > thanks for the quick reply. Here it is:
> > > https://issues.apache.org/jira/browse/OPENNLP-1366
> > > 
> > > Using the treebank from Tübingen might not be feasable as it
> > > consumes
> > > around 2 TB RAM ;) - the mentioned link in the ticket points to a
> > > smaller dataset, which should reproduce the issue with a feasable
> > > amount of required RAM.
> > > 
> > > It basically boils down to a size limitation in the JDK's
> > > DataOutputStream.
> > > 
> > > Gruß
> > > Richard
> > > 
> > > Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff Zemerick:
> > > > Hi Richard,
> > > > 
> > > > Thanks for reporting this. A Jira issue with steps to reproduce
> > > > it
> > > > would be
> > > > fantastic. https://issues.apache.org/jira/projects/OPENNLP
> > > > 
> > > > Please create one and reply back here with its ID once you do.
> > > > I can
> > > > take a
> > > > look and see what can be done.
> > > > 
> > > > Thanks,
> > > > Jeff
> > > > 
> > > > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard <
> > > > richard.zowa...@hs-heilbronn.de> wrote:
> > > > 
> > > > > Hi all,
> > > > > 
> > > > > we are working on training a large opennlp maxent model for
> > > > > lemmatizing
> > > > > German texts. We use a wikipedia tree bank from Tübingen.
> > > > > 
> > > > > This works fine for mid size corpora (just need a little bit
> > > > > of RAM
> > > > > and
> > > > > time). However, we are running into the exception mentioned
> > > > > in [1].
> > > > > Debugging into the DataOutputStream reveals, that this is a
> > > > > limitation
> > > > > of the java.io.DataOutputstream.
> > > > > 
> > > > > Do we have any chance to solve this or do we need to
> > > > > implement
> > > > > custom
> > > > > readers / writers in order to get it work?
> > > > > 
> > > > > If this is a general problem for large corpora, I am also
> > > > > happy to
> > > > > create a related ticket / issue in Jira with steps to
> > > > > reproduce ;)
> > > > > 
> > > > > Thanks in advance.
> > > > > 
> > > > > Gruß
> > > > > Richard
> > > > > 
> > > > > [1]
> > > > > 
> > > > > 
> > > https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long
> > > > > 
> > > > > --
> > > > > Richard Zowalla, M.Sc.
> > > > > Research Associate, PhD Student | Medical Informatics
> > > > > 
> > > > > Hochschule Heilbronn – University of Applied Sciences
> > > > > Max-Planck-Str. 39
> > > > > D-74081 Heilbronn
> > > > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon
> > > > > erreichbar)
> > > > > mail: richard.zowa...@hs-heilbronn.de
> > > > > web: https://www.mi.hs-heilbronn.de/
> > > > > 
> > > --
> > > Richard Zowalla, M.Sc.
> > > Research Associate, PhD Student | Medical Informatics
> > > 
> > > Hochschule Heilbronn – University of Applied Sciences
> > > Max-Planck-Str. 39
> > > D-74081 Heilbronn
> > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar)
> > > mail: richard.zowa...@hs-heilbronn.de
> > > web: https://www.mi.hs-heilbronn.de/
> > >

smime.p7s
Description: S/MIME cryptographic signature

Re: Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Reply via email to