Luckily, this looks like a common problem [1] for years regarding writeUTF(). Following other guidance and the function's javadocs [2], writeUTF() writes the number of bytes written out followed by the string. Changing it to manually write the length of the string followed by write() allows the training to succeed. All unit tests pass and this seems to indicate it would be backward compatible because of unit tests that load models in src/test/resources/, but I want to verify that more to be sure.
Here's the changes: https://github.com/apache/opennlp/compare/master...jzonthemtn:OPENNLP-1366?expand=1 I am unsure of the writeShort() method for writing the length of the string. Even though it works for the UD data now, is that actually resolving the problem? Anyone have any insights into this? Thanks, Jeff [1] https://stackoverflow.com/questions/22741556/dataoutputstream-purpose-of-the-encoded-string-too-long-restriction [2] https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#writeUTF(java.lang.String) On Mon, Apr 11, 2022 at 1:41 PM Jeff Zemerick <jzemer...@apache.org> wrote: > Great, thanks. I was able to reproduce the problem. I'll take a look and > keep this thread updated. > > Thanks, > Jeff > > On Mon, Apr 11, 2022 at 10:22 AM Zowalla, Richard < > richard.zowa...@hs-heilbronn.de> wrote: > >> Hi Jeff, >> >> thanks for the quick reply. Here it is: >> https://issues.apache.org/jira/browse/OPENNLP-1366 >> >> Using the treebank from Tübingen might not be feasable as it consumes >> around 2 TB RAM ;) - the mentioned link in the ticket points to a >> smaller dataset, which should reproduce the issue with a feasable >> amount of required RAM. >> >> It basically boils down to a size limitation in the JDK's >> DataOutputStream. >> >> Gruß >> Richard >> >> Am Montag, dem 11.04.2022 um 10:13 -0400 schrieb Jeff Zemerick: >> > Hi Richard, >> > >> > Thanks for reporting this. A Jira issue with steps to reproduce it >> > would be >> > fantastic. https://issues.apache.org/jira/projects/OPENNLP >> > >> > Please create one and reply back here with its ID once you do. I can >> > take a >> > look and see what can be done. >> > >> > Thanks, >> > Jeff >> > >> > On Mon, Apr 11, 2022 at 8:47 AM Zowalla, Richard < >> > richard.zowa...@hs-heilbronn.de> wrote: >> > >> > > Hi all, >> > > >> > > we are working on training a large opennlp maxent model for >> > > lemmatizing >> > > German texts. We use a wikipedia tree bank from Tübingen. >> > > >> > > This works fine for mid size corpora (just need a little bit of RAM >> > > and >> > > time). However, we are running into the exception mentioned in [1]. >> > > Debugging into the DataOutputStream reveals, that this is a >> > > limitation >> > > of the java.io.DataOutputstream. >> > > >> > > Do we have any chance to solve this or do we need to implement >> > > custom >> > > readers / writers in order to get it work? >> > > >> > > If this is a general problem for large corpora, I am also happy to >> > > create a related ticket / issue in Jira with steps to reproduce ;) >> > > >> > > Thanks in advance. >> > > >> > > Gruß >> > > Richard >> > > >> > > [1] >> > > >> > > >> https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long >> > > >> > > >> > > -- >> > > Richard Zowalla, M.Sc. >> > > Research Associate, PhD Student | Medical Informatics >> > > >> > > Hochschule Heilbronn – University of Applied Sciences >> > > Max-Planck-Str. 39 >> > > D-74081 Heilbronn >> > > phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar) >> > > mail: richard.zowa...@hs-heilbronn.de >> > > web: https://www.mi.hs-heilbronn.de/ >> > > >> -- >> Richard Zowalla, M.Sc. >> Research Associate, PhD Student | Medical Informatics >> >> Hochschule Heilbronn – University of Applied Sciences >> Max-Planck-Str. 39 >> D-74081 Heilbronn >> phone: +49 7131 504 6791 (zur Zeit nicht via Telefon erreichbar) >> mail: richard.zowa...@hs-heilbronn.de >> web: https://www.mi.hs-heilbronn.de/ >> >