[GitHub] [opennlp] mawiesne commented on a diff in pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

GitBox Tue, 25 Oct 2022 23:26:56 -0700


mawiesne commented on code in PR #427:
URL: https://github.com/apache/opennlp/pull/427#discussion_r1005268338



##########
opennlp-tools/src/main/java/opennlp/tools/ml/model/ModelParameterChunker.java:
##########
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.ml.model;
+
+import java.io.DataInputStream;
+import java.io.DataOutputStream;
+import java.io.IOException;
+import java.io.UTFDataFormatException;
+import java.nio.ByteBuffer;
+import java.nio.CharBuffer;
+import java.nio.charset.CharsetEncoder;
+import java.nio.charset.CoderResult;
+import java.nio.charset.StandardCharsets;
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * A helper class that handles Strings with more than 64k (65535 bytes) in 
length.
+ * This is achieved via the signature {@link #SIGNATURE_CHUNKED_PARAMS} at the 
beginning of
+ * the String instance to be written to a {@link DataOutputStream}.
+ * <p>
+ * Background: In OpenNLP, for large(r) corpora, we train models whose (UTF 
String) parameters will exceed
+ * the {@link #MAX_CHUNK_SIZE_BYTES} bytes limit set in {@link 
DataOutputStream}.
+ * For writing and reading those models, we have to chunk up those string 
instances in 64kB blocks and
+ * recombine them correctly upon reading a (binary) model file.
+ * <p>
+ * The problem was raised in <a 
href="https://issues.apache.org/jira/browse/OPENNLP-1366";>ticket 
OPENNLP-1366</a>.
+ * <p>
+ * Solution strategy:
+ * <ul>
+ * <li>If writing parameters to a {@link DataOutputStream} blows up with a 
{@link UTFDataFormatException} a
+ * large String instance is chunked up and written as appropriate blocks.</li>
+ * <li>To indicate that chunking was conducted, we start with the {@link 
#SIGNATURE_CHUNKED_PARAMS} indicator,
+ * directly followed by the number of chunks used. This way, when reading in 
chunked model parameters,
+ * recombination is achieved transparently.</li>
+ * </ul>
+ * <p>
+ * Note: Both, existing (binary) model files and newly trained models which 
don't require the chunking
+ * technique, will be supported like in previous OpenNLP versions.
+ *
+ * @author <a href="mailto:[email protected]";>Martin Wiesner</a>
+ * @author <a href="mailto:[email protected]";>Mark Struberg</a>
+ */
+public final class ModelParameterChunker {
+
+  /*
+   * A signature that denotes the start of a String that required chunking.
+   *
+   * Semantics:
+   * If a model parameter (String) carries the below signature at the very 
beginning, this indicates
+   * that 'n > 1' chunks must be processed to obtain the whole model 
parameters. Otherwise, those would not be
+   * written to the binary model files (as reported in OPENNLP-1366) if the 
training occurs on large corpora
+   * as used, for instance, in the context of (very large) German NLP models.
+   */
+  public static final String SIGNATURE_CHUNKED_PARAMS = 
"CHUNKED-MODEL-PARAMS:"; // followed by no of chunks!
+
+  private static final int MAX_CHUNK_SIZE_BYTES = 65535; // the maximum 
'utflen' DataOutputStream can handle
+
+  private ModelParameterChunker(){
+    // private utility class ct s
+  }
+
+  /**
+   * Reads model parameters from {@code dis}. In case the stream start with 
{@link #SIGNATURE_CHUNKED_PARAMS},
+   * the number of chunks is detected and the original large parameter string 
is reconstructed from several
+   * chunks.
+   *
+   * @param dis   The stream which will be used to read the model parameter 
from.
+   */
+  public static String readUTF(DataInputStream dis) throws IOException {
+    String data = dis.readUTF();
+    if (data.startsWith(SIGNATURE_CHUNKED_PARAMS)) {
+      String chunkElements = data.replace(SIGNATURE_CHUNKED_PARAMS, "");
+      int chunkSize = Integer.parseInt(chunkElements);
+      StringBuilder sb = new StringBuilder();
+      for (int i = 0; i < chunkSize; i++) {
+        sb.append(dis.readUTF());
+      }
+      return sb.toString(); // the reconstructed model parameter string
+    } else {  // default case: no chunked data -> just return the read data / 
parameter information
+      return data;
+    }
+  }
+
+  /**
+   * Writes the model parameter {@code s} to {@code dos}. In case {@code s} 
does exceed
+   * {@link #MAX_CHUNK_SIZE_BYTES} in length, the chunking mechanism is used; 
otherwise the parameter is
+   * written 'as is'.
+   *
+   * @param dos   The {@link DataOutputStream} stream which will be used to 
persist the model.
+   * @param s     The input string that is checked for length and chunked if 
{@link #MAX_CHUNK_SIZE_BYTES} is
+   *              exceeded.
+   */
+  public static void writeUTF(DataOutputStream dos, String s) throws 
IOException {
+    try {
+      dos.writeUTF(s);
+    } catch (UTFDataFormatException dfe) {

Review Comment:
   @jzonthemtn In general, this looks entirely plausible. Yet, there are 
specific details for encoding characters in UTF(8) format. Therefore, byte 
length of `s` is not sufficient or valid for all cases. 
   
   Thus, @struberg and I decided to leave to the actual implementation of 
`DataOutputStream.writeUTF(...)` to tell us the truth. Have a look and check 
the code there, if interested, for the "difficult" part. We decided not to cnp 
the code over to OpenNLP, but to rely on the checks made in `DataOutputStream`. 
   
   I hope, you can follow our idea/decision to go for "avoid cnp" in favor of 
catching `UTFDataFormatException` here once, in case the limit exceeds 64k. 
This seems acceptable for those "rare" occasions when training with large 
corpora. 
   
   As a background: The 'Tueba Wikipedia' treebank only runs into this 
exception twice; other (well- known) German corpora, like the one from Hamburg 
I also used for validation, hit it 3 or 4 times during training.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [opennlp] mawiesne commented on a diff in pull request #427: OPENNLP-1366: Fix Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Reply via email to