Re: [PR] OPENNLP-1845 - Fix numerically unstable softmax in DocumentCategorizerDL (opennlp)

via GitHub Sat, 13 Jun 2026 16:34:08 -0700


Copilot commented on code in PR #1085:
URL: https://github.com/apache/opennlp/pull/1085#discussion_r3408756652



##########
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java:
##########
@@ -298,21 +302,20 @@ private List<Tokens> tokenize(final String text) {
     // Split the input text into 200 word chunks with 50 overlapping between 
chunks.
     final String[] whitespaceTokenized = text.split("\\s+");
 
-    for (int start = 0; start < whitespaceTokenized.length;
-         start = start + inferenceOptions.getDocumentSplitSize()) {
+    int start = 0;
+    while (start < whitespaceTokenized.length) {
 
-      // 200 word length chunk
-      // Check the end do don't go past and get a 
StringIndexOutOfBoundsException
-      int end = start + inferenceOptions.getDocumentSplitSize();
-      if (end > whitespaceTokenized.length) {
-        end = whitespaceTokenized.length;
-      }
+      // 200 word length chunk, clamped so we never read past the end.
+      final int end =
+          Math.min(start + inferenceOptions.getDocumentSplitSize(), 
whitespaceTokenized.length);
 
       // The group is that subsection of string.
       final String group = String.join(" ", 
Arrays.copyOfRange(whitespaceTokenized, start, end));
 
-      // We want to overlap each chunk by 50 words so scoot back 50 words for 
the next iteration.
-      start = start - inferenceOptions.getSplitOverlapSize();
+      // Advance to the next chunk, overlapping by the configured amount; stop 
once the end
+      // of the input has been consumed.
+      start = end == whitespaceTokenized.length
+          ? end : end - inferenceOptions.getSplitOverlapSize();

Review Comment:
   The chunk-advancement logic can hang or throw if `splitOverlapSize` is 
misconfigured (e.g., `splitOverlapSize >= documentSplitSize` makes `start` not 
advance; `splitOverlapSize > documentSplitSize` can make `start` negative and 
trigger `Arrays.copyOfRange` index errors). Since `InferenceOptions` setters 
don’t validate these values, ensure the loop always makes forward progress even 
for invalid overlap values.



##########
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/doccat/DocumentCategorizerDL.java:
##########
@@ -198,7 +202,7 @@ public double[] categorize(String[] strings) {
       return classificationScoringStrategy.score(scores);
 
     } catch (Exception ex) {
-      logger.error("Unload to perform document classification inference", ex);
+      logger.error("Unable to perform document classification inference", ex);
     }
 
     return new double[] {};

Review Comment:
   Returning an empty score array on inference failure can cause 
`scoreMap`/`sortedScoreMap` to throw `ArrayIndexOutOfBoundsException` because 
they index `scores[x]` for every category key. Consider returning an array 
sized to `categories.size()` instead, so the rest of the `DocumentCategorizer` 
API remains safe to call even after an inference error (and update the method 
Javadoc accordingly).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1845 - Fix numerically unstable softmax in DocumentCategorizerDL (opennlp)

Reply via email to