Re: [PR] OPENNLP-1846 - Recognize all entity types in NameFinderDL, not only p… (opennlp)

via GitHub Tue, 16 Jun 2026 05:28:30 -0700


Copilot commented on code in PR #1086:
URL: https://github.com/apache/opennlp/pull/1086#discussion_r3420671269



##########
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java:
##########
@@ -151,239 +157,320 @@ public Span[] find(String[] input) {
 
     final String[] sentences = sentenceDetector.sentDetect(text);
 
+    // A monotonic cursor into the full text, threaded across every sentence 
and chunk so a
+    // repeated entity surface form is located at its next occurrence rather 
than re-matched at
+    // the first one (which would emit duplicate spans for 
multi-sentence/multi-chunk input).
+    int searchStart = 0;
+
     for (String sentence : sentences) {
 
       // The WordPiece tokenized text. This changes the spacing in the text.
       final List<Tokens> wordpieceTokens = tokenize(sentence);
 
       for (final Tokens tokens : wordpieceTokens) {
+        final List<Span> decoded =
+            decodeSpans(text, tokens.tokens(), infer(tokens), ids2Labels, 
searchStart);
+        spans.addAll(decoded);
+        if (!decoded.isEmpty()) {
+          searchStart = decoded.get(decoded.size() - 1).getEnd();
+        }
+      }
 
-        try {
-
-          // The inputs to the ONNX model.
-          final Map<String, OnnxTensor> inputs = new HashMap<>();
-
-          final float[][][] v;
-          try {
-            inputs.put(INPUT_IDS, OnnxTensor.createTensor(env, 
LongBuffer.wrap(tokens.ids()),
-                new long[] {1, tokens.ids().length}));
-
-            if (includeAttentionMask) {
-              inputs.put(ATTENTION_MASK, OnnxTensor.createTensor(env,
-                  LongBuffer.wrap(tokens.mask()), new long[] {1, 
tokens.mask().length}));
-            }
-
-            if (includeTokenTypeIds) {
-              inputs.put(TOKEN_TYPE_IDS, OnnxTensor.createTensor(env,
-                  LongBuffer.wrap(tokens.types()), new long[] {1, 
tokens.types().length}));
-            }
-
-            // The outputs from the model.
-            try (OrtSession.Result result = session.run(inputs)) {
-              // getValue() copies the tensor into Java arrays, so the result 
can be closed safely.
-              v = (float[][][]) result.get(0).getValue();
-            }
-          } finally {
-            inputs.values().forEach(OnnxTensor::close);
-          }
-
-          // Find consecutive B-PER and I-PER labels and combine the spans 
where necessary.
-          // There are also B-LOC and I-LOC tags for locations that might be 
useful at some point.
+    }
 
-          // Keep track of where the last span was so when there are 
multiple/duplicate
-          // spans we can get the next one instead of the first one each time.
-          int characterStart = 0;
+    return spans.toArray(new Span[0]);
 
-          final String[] toks = tokens.tokens();
+  }
 
-          // We are looping over the vector for each word,
-          // finding the index of the array that has the maximum value,
-          // and then finding the token classification that corresponds to 
that index.
-          for (int x = 0; x < v[0].length; x++) {
+  /**
+   * Runs the model on one token window and returns the per-token label score 
rows. A failure
+   * executing the model (an {@link OrtException} or any runtime fault) is 
surfaced as an
+   * {@link IllegalStateException} (cause preserved); an unexpected output 
shape is its own loud
+   * failure. This mirrors the fail-loud contract of the sibling {@code 
DocumentCategorizerDL}.
+   *
+   * @param tokens The tokens for one chunk to run inference on.
+   * @return The {@code [token][label]} score matrix for the chunk.
+   */
+  private float[][] infer(final Tokens tokens) {
 
-            final float[] arr = v[0][x];
-            final int maxIndex = maxIndex(arr);
-            final String label = ids2Labels.get(maxIndex);
+    final Map<String, OnnxTensor> inputs = new HashMap<>();
+    final Object output;
+    try {
+      inputs.put(INPUT_IDS, OnnxTensor.createTensor(env, 
LongBuffer.wrap(tokens.ids()),
+          new long[] {1, tokens.ids().length}));
 
-            // TODO: Need to make sure this value is between 0 and 1?
-            // Can we do thresholding without it between 0 and 1?
-            final double confidence = arr[maxIndex]; // / 10;
+      if (includeAttentionMask) {
+        inputs.put(ATTENTION_MASK, OnnxTensor.createTensor(env,
+            LongBuffer.wrap(tokens.mask()), new long[] {1, 
tokens.mask().length}));
+      }
 
-            // Is this is the start of a person entity.
-            if (B_PER.equals(label)) {
+      if (includeTokenTypeIds) {
+        inputs.put(TOKEN_TYPE_IDS, OnnxTensor.createTensor(env,
+            LongBuffer.wrap(tokens.types()), new long[] {1, 
tokens.types().length}));
+      }
 
-              String spanText;
+      // getValue() copies the tensor into Java arrays, so the result can be 
closed safely.
+      try (OrtSession.Result result = session.run(inputs)) {
+        output = result.get(0).getValue();
+      }
+    } catch (OrtException | RuntimeException ex) {
+      throw new IllegalStateException("Unable to perform name finder 
inference", ex);
+    } finally {
+      inputs.values().forEach(OnnxTensor::close);
+    }
 
-              // Find the end index of the span in the array (where the label 
is not I-PER).
-              final SpanEnd spanEnd = findSpanEnd(v, x, ids2Labels, toks);
+    // The model returns one score row per token, batched: 
float[batch][token][label]. Any other
+    // shape (or an empty batch) is a model-contract violation, surfaced on 
its own rather than as
+    // "inference failed".
+    if (output instanceof float[][][] v && v.length > 0) {
+      return v[0];
+    }
+    throw new IllegalStateException("Unexpected model output type: "
+        + (output == null ? "null" : output.getClass().getName()));
+  }
 
-              // If the end is -1 it means this is a single-span token.
-              // If the end is != -1 it means this is a multi-span token.
-              if (spanEnd.index() != -1) {
+  @Override
+  public void clearAdaptiveData() {
+    // No use in this implementation.
+  }
 
-                final StringBuilder sb = new StringBuilder();
+  /**
+   * Decodes spans beginning the character search at the start of {@code 
text}. Equivalent to
+   * {@link #decodeSpans(String, String[], float[][], Map, int)} with {@code 
searchStart == 0}.
+   *
+   * @param text The original text passed to the model.
+   * @param tokens The WordPiece tokens produced for the text.
+   * @param tokenLabelScores The per-token label scores returned by the model.
+   * @param id2Labels The mapping from model output indexes to BIO labels.
+   * @return The decoded spans.
+   */
+  static List<Span> decodeSpans(String text, String[] tokens, float[][] 
tokenLabelScores,
+                                Map<Integer, String> id2Labels) {
+    return decodeSpans(text, tokens, tokenLabelScores, id2Labels, 0);
+  }
 
-                // We have to concatenate the tokens.
-                // Add each token in the array and separate them with a space.
-                // We'll separate each with a single space because later we'll 
find the original span
-                // in the text and ignore spacing between individual tokens in 
findByRegex().
-                int end = spanEnd.index();
-                for (int i = x; i <= end; i++) {
+  /**
+   * Converts model token classifications into character spans in the original 
input text.
+   *
+   * <p>The ONNX model returns one score vector for each WordPiece token. This 
method applies
+   * BIO decoding, reconstructs WordPiece fragments, and then resolves the 
reconstructed text
+   * against the original sentence so that {@link 
Span#getCoveredText(CharSequence)} works with
+   * the caller's input.</p>
+   *
+   * @param text The original text passed to the model.
+   * @param tokens The WordPiece tokens produced for the text.
+   * @param tokenLabelScores The per-token label scores returned by the model.
+   * @param id2Labels The mapping from model output indexes to BIO labels.
+   * @param searchStart The character offset in {@code text} to begin locating 
spans from. Threading
+   *     a monotonic cursor across the chunks and sentences of a single {@link 
#find(String[])} call
+   *     keeps a repeated entity surface form from being emitted twice at the 
same first occurrence.
+   * @return The decoded spans.
+   */
+  static List<Span> decodeSpans(String text, String[] tokens, float[][] 
tokenLabelScores,
+                                Map<Integer, String> id2Labels, int 
searchStart) {
 
-                  // If the next token starts with ##, combine it with this 
token.
-                  if (toks[i + 1].startsWith(CHARS_TO_REPLACE)) {
+    if (tokens.length != tokenLabelScores.length) {
+      throw new IllegalArgumentException("The number of tokens must match the 
number of "
+          + "model output rows.");
+    }

Review Comment:
   The token/score-row count mismatch exception is missing the actual counts, 
which makes debugging model/tokenizer mismatches harder. Include both lengths 
in the error message.



##########
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java:
##########
@@ -151,239 +157,320 @@ public Span[] find(String[] input) {
 
     final String[] sentences = sentenceDetector.sentDetect(text);
 
+    // A monotonic cursor into the full text, threaded across every sentence 
and chunk so a
+    // repeated entity surface form is located at its next occurrence rather 
than re-matched at
+    // the first one (which would emit duplicate spans for 
multi-sentence/multi-chunk input).
+    int searchStart = 0;
+
     for (String sentence : sentences) {
 
       // The WordPiece tokenized text. This changes the spacing in the text.
       final List<Tokens> wordpieceTokens = tokenize(sentence);
 
       for (final Tokens tokens : wordpieceTokens) {
+        final List<Span> decoded =
+            decodeSpans(text, tokens.tokens(), infer(tokens), ids2Labels, 
searchStart);
+        spans.addAll(decoded);
+        if (!decoded.isEmpty()) {
+          searchStart = decoded.get(decoded.size() - 1).getEnd();
+        }

Review Comment:
   `searchStart` is only advanced when `decoded` is non-empty. If a 
sentence/chunk yields no spans, the next sentence/chunk will still search from 
an earlier offset and may match an earlier occurrence of the same surface form, 
producing incorrect character offsets. To keep the cursor truly monotonic 
across sentences/chunks, advance it to at least the end of the current sentence 
(and thread a per-sentence cursor across chunks).



##########
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java:
##########
@@ -151,239 +157,320 @@ public Span[] find(String[] input) {
 
     final String[] sentences = sentenceDetector.sentDetect(text);
 
+    // A monotonic cursor into the full text, threaded across every sentence 
and chunk so a
+    // repeated entity surface form is located at its next occurrence rather 
than re-matched at
+    // the first one (which would emit duplicate spans for 
multi-sentence/multi-chunk input).
+    int searchStart = 0;
+
     for (String sentence : sentences) {
 
       // The WordPiece tokenized text. This changes the spacing in the text.
       final List<Tokens> wordpieceTokens = tokenize(sentence);
 
       for (final Tokens tokens : wordpieceTokens) {
+        final List<Span> decoded =
+            decodeSpans(text, tokens.tokens(), infer(tokens), ids2Labels, 
searchStart);
+        spans.addAll(decoded);
+        if (!decoded.isEmpty()) {
+          searchStart = decoded.get(decoded.size() - 1).getEnd();
+        }
+      }
 
-        try {
-
-          // The inputs to the ONNX model.
-          final Map<String, OnnxTensor> inputs = new HashMap<>();
-
-          final float[][][] v;
-          try {
-            inputs.put(INPUT_IDS, OnnxTensor.createTensor(env, 
LongBuffer.wrap(tokens.ids()),
-                new long[] {1, tokens.ids().length}));
-
-            if (includeAttentionMask) {
-              inputs.put(ATTENTION_MASK, OnnxTensor.createTensor(env,
-                  LongBuffer.wrap(tokens.mask()), new long[] {1, 
tokens.mask().length}));
-            }
-
-            if (includeTokenTypeIds) {
-              inputs.put(TOKEN_TYPE_IDS, OnnxTensor.createTensor(env,
-                  LongBuffer.wrap(tokens.types()), new long[] {1, 
tokens.types().length}));
-            }
-
-            // The outputs from the model.
-            try (OrtSession.Result result = session.run(inputs)) {
-              // getValue() copies the tensor into Java arrays, so the result 
can be closed safely.
-              v = (float[][][]) result.get(0).getValue();
-            }
-          } finally {
-            inputs.values().forEach(OnnxTensor::close);
-          }
-
-          // Find consecutive B-PER and I-PER labels and combine the spans 
where necessary.
-          // There are also B-LOC and I-LOC tags for locations that might be 
useful at some point.
+    }
 
-          // Keep track of where the last span was so when there are 
multiple/duplicate
-          // spans we can get the next one instead of the first one each time.
-          int characterStart = 0;
+    return spans.toArray(new Span[0]);
 
-          final String[] toks = tokens.tokens();
+  }
 
-          // We are looping over the vector for each word,
-          // finding the index of the array that has the maximum value,
-          // and then finding the token classification that corresponds to 
that index.
-          for (int x = 0; x < v[0].length; x++) {
+  /**
+   * Runs the model on one token window and returns the per-token label score 
rows. A failure
+   * executing the model (an {@link OrtException} or any runtime fault) is 
surfaced as an
+   * {@link IllegalStateException} (cause preserved); an unexpected output 
shape is its own loud
+   * failure. This mirrors the fail-loud contract of the sibling {@code 
DocumentCategorizerDL}.
+   *
+   * @param tokens The tokens for one chunk to run inference on.
+   * @return The {@code [token][label]} score matrix for the chunk.
+   */
+  private float[][] infer(final Tokens tokens) {
 
-            final float[] arr = v[0][x];
-            final int maxIndex = maxIndex(arr);
-            final String label = ids2Labels.get(maxIndex);
+    final Map<String, OnnxTensor> inputs = new HashMap<>();
+    final Object output;
+    try {
+      inputs.put(INPUT_IDS, OnnxTensor.createTensor(env, 
LongBuffer.wrap(tokens.ids()),
+          new long[] {1, tokens.ids().length}));
 
-            // TODO: Need to make sure this value is between 0 and 1?
-            // Can we do thresholding without it between 0 and 1?
-            final double confidence = arr[maxIndex]; // / 10;
+      if (includeAttentionMask) {
+        inputs.put(ATTENTION_MASK, OnnxTensor.createTensor(env,
+            LongBuffer.wrap(tokens.mask()), new long[] {1, 
tokens.mask().length}));
+      }
 
-            // Is this is the start of a person entity.
-            if (B_PER.equals(label)) {
+      if (includeTokenTypeIds) {
+        inputs.put(TOKEN_TYPE_IDS, OnnxTensor.createTensor(env,
+            LongBuffer.wrap(tokens.types()), new long[] {1, 
tokens.types().length}));
+      }
 
-              String spanText;
+      // getValue() copies the tensor into Java arrays, so the result can be 
closed safely.
+      try (OrtSession.Result result = session.run(inputs)) {
+        output = result.get(0).getValue();
+      }
+    } catch (OrtException | RuntimeException ex) {
+      throw new IllegalStateException("Unable to perform name finder 
inference", ex);
+    } finally {
+      inputs.values().forEach(OnnxTensor::close);
+    }
 
-              // Find the end index of the span in the array (where the label 
is not I-PER).
-              final SpanEnd spanEnd = findSpanEnd(v, x, ids2Labels, toks);
+    // The model returns one score row per token, batched: 
float[batch][token][label]. Any other
+    // shape (or an empty batch) is a model-contract violation, surfaced on 
its own rather than as
+    // "inference failed".
+    if (output instanceof float[][][] v && v.length > 0) {
+      return v[0];
+    }
+    throw new IllegalStateException("Unexpected model output type: "
+        + (output == null ? "null" : output.getClass().getName()));
+  }
 
-              // If the end is -1 it means this is a single-span token.
-              // If the end is != -1 it means this is a multi-span token.
-              if (spanEnd.index() != -1) {
+  @Override
+  public void clearAdaptiveData() {
+    // No use in this implementation.
+  }
 
-                final StringBuilder sb = new StringBuilder();
+  /**
+   * Decodes spans beginning the character search at the start of {@code 
text}. Equivalent to
+   * {@link #decodeSpans(String, String[], float[][], Map, int)} with {@code 
searchStart == 0}.
+   *
+   * @param text The original text passed to the model.
+   * @param tokens The WordPiece tokens produced for the text.
+   * @param tokenLabelScores The per-token label scores returned by the model.
+   * @param id2Labels The mapping from model output indexes to BIO labels.
+   * @return The decoded spans.
+   */
+  static List<Span> decodeSpans(String text, String[] tokens, float[][] 
tokenLabelScores,
+                                Map<Integer, String> id2Labels) {
+    return decodeSpans(text, tokens, tokenLabelScores, id2Labels, 0);
+  }
 
-                // We have to concatenate the tokens.
-                // Add each token in the array and separate them with a space.
-                // We'll separate each with a single space because later we'll 
find the original span
-                // in the text and ignore spacing between individual tokens in 
findByRegex().
-                int end = spanEnd.index();
-                for (int i = x; i <= end; i++) {
+  /**
+   * Converts model token classifications into character spans in the original 
input text.
+   *
+   * <p>The ONNX model returns one score vector for each WordPiece token. This 
method applies
+   * BIO decoding, reconstructs WordPiece fragments, and then resolves the 
reconstructed text
+   * against the original sentence so that {@link 
Span#getCoveredText(CharSequence)} works with
+   * the caller's input.</p>
+   *
+   * @param text The original text passed to the model.
+   * @param tokens The WordPiece tokens produced for the text.
+   * @param tokenLabelScores The per-token label scores returned by the model.
+   * @param id2Labels The mapping from model output indexes to BIO labels.
+   * @param searchStart The character offset in {@code text} to begin locating 
spans from. Threading
+   *     a monotonic cursor across the chunks and sentences of a single {@link 
#find(String[])} call
+   *     keeps a repeated entity surface form from being emitted twice at the 
same first occurrence.
+   * @return The decoded spans.
+   */
+  static List<Span> decodeSpans(String text, String[] tokens, float[][] 
tokenLabelScores,
+                                Map<Integer, String> id2Labels, int 
searchStart) {
 
-                  // If the next token starts with ##, combine it with this 
token.
-                  if (toks[i + 1].startsWith(CHARS_TO_REPLACE)) {
+    if (tokens.length != tokenLabelScores.length) {
+      throw new IllegalArgumentException("The number of tokens must match the 
number of "
+          + "model output rows.");
+    }
 
-                    sb.append(toks[i]).append(toks[i + 
1].replace(CHARS_TO_REPLACE, ""));
+    final List<Span> spans = new LinkedList<>();
 
-                    // Append a space unless the next (next) token starts with 
##.
-                    if (!toks[i + 2].startsWith(CHARS_TO_REPLACE)) {
-                      sb.append(" ");
-                    }
+    int characterStart = searchStart;
 
-                    // Skip the next token since we just included it in this 
iteration.
-                    i++;
+    for (int x = 0; x < tokenLabelScores.length; x++) {
+      final LabelPrediction prediction = predictLabel(tokenLabelScores[x], 
id2Labels);
+      if (!isBeginLabel(prediction.label())) {
+        continue;
+      }
 
-                  } else {
+      final String entityType = 
prediction.label().substring(BEGIN_PREFIX.length());
+      final EntityPrediction entity = findEntityEnd(tokenLabelScores, x, 
id2Labels,
+          entityType, prediction.probability());
+      final String spanText = buildSpanText(tokens, x, entity.endIndex());
 
-                    sb.append(toks[i].replace(CHARS_TO_REPLACE, ""));
+      if (spanText.isBlank() || SEPARATOR.equals(spanText)) {
+        x = entity.endIndex();
+        continue;
+      }
 
-                    // Append a space unless the next token is a period.
-                    if (!".".equals(toks[i + 1])) {
-                      sb.append(" ");
-                    }
+      final SpanMatch match = findByRegex(text, spanText, characterStart);
+      if (match.start() != -1) {
+        spans.add(new Span(match.start(), match.end(), entityType, 
entity.probability()));
+        characterStart = match.end();
+      }
 
-                  }
+      x = entity.endIndex();
+    }
 
-                }
+    return spans;
 
-                // This is the text of the span. We use the whole original 
input text and not one
-                // of the splits. This gives us accurate character positions.
-                spanText = findByRegex(text, sb.toString().trim()).trim();
+  }
 
-              } else {
+  private static EntityPrediction findEntityEnd(float[][] tokenLabelScores, 
int startIndex,
+                                                Map<Integer, String> id2Labels,
+                                                String entityType,
+                                                double startProbability) {
 
-                // This is a single-token span so there is nothing else to do 
except grab the token.
-                spanText = toks[x];
+    final String insideLabel = INSIDE_PREFIX + entityType;
+    int endIndex = startIndex;
+    double probability = startProbability;
 
-              }
+    for (int x = startIndex + 1; x < tokenLabelScores.length; x++) {
+      final LabelPrediction prediction = predictLabel(tokenLabelScores[x], 
id2Labels);
+      if (!insideLabel.equals(prediction.label())) {
+        break;
+      }
+      endIndex = x;
+      probability = Math.min(probability, prediction.probability());
+    }
 
-              if (!SEPARATOR.equals(spanText)) {
+    return new EntityPrediction(endIndex, probability);
 
-                spanText = spanText.replace(CHARS_TO_REPLACE, "");
+  }
 
-                // This ignores other potential matches in the same sentence
-                // by only taking the first occurrence.
-                characterStart = text.indexOf(spanText, characterStart);
+  private static boolean isBeginLabel(String label) {
+    return label.startsWith(BEGIN_PREFIX) && label.length() > 
BEGIN_PREFIX.length();
+  }
 
-                // TODO: This check should not be needed because the span was 
found.
-                // If we aren't finding it now it's because there's a 
whitespace difference.
-                if (characterStart != -1) {
+  private static LabelPrediction predictLabel(float[] scores, Map<Integer, 
String> id2Labels) {
 
-                  final int characterEnd = characterStart + spanText.length();
+    final int labelIndex = maxIndex(scores);
+    final String label = id2Labels.get(labelIndex);
+    if (label == null) {
+      throw new IllegalArgumentException("No label is configured for model 
output index "
+          + labelIndex + ".");
+    }
 
-                  spans.add(new Span(characterStart, characterEnd, spanText, 
confidence));
+    return new LabelPrediction(label, labelProbability(scores, labelIndex));
 
-                  // OP-1: Only increment characterStart by one.
-                  characterStart++;
+  }
 
-                }
+  static double labelProbability(float[] scores, int labelIndex) {
 
-              }
+    int positiveInfinityCount = 0;
+    double max = Float.NEGATIVE_INFINITY;
 
-            }
+    for (float score : scores) {
+      if (score == Float.POSITIVE_INFINITY) {
+        positiveInfinityCount++;
+      } else if (!Float.isNaN(score) && score > max) {
+        max = score;
+      }
+    }
 
-          }
+    if (positiveInfinityCount > 0) {
+      return scores[labelIndex] == Float.POSITIVE_INFINITY ? 1d / 
positiveInfinityCount : 0d;
+    }
 
-        } catch (OrtException ex) {
-          throw new RuntimeException("Error performing namefinder inference: " 
+ ex.getMessage(), ex);
-        }
+    if (max == Float.NEGATIVE_INFINITY) {
+      return 1d / scores.length;
+    }
 
+    double denominator = 0;
+    for (float score : scores) {
+      if (!Float.isNaN(score)) {
+        denominator += Math.exp(score - max);
       }
-
     }
 
-    return spans.toArray(new Span[0]);
+    return Math.exp(scores[labelIndex] - max) / denominator;
 
   }
 
-  @Override
-  public void clearAdaptiveData() {
-    // No use in this implementation.
-  }
+  static String buildSpanText(String[] tokens, int startIndex, int endIndex) {
 
-  private SpanEnd findSpanEnd(float[][][] v, int startIndex, Map<Integer, 
String> id2Labels,
-                              String[] tokens) {
+    final StringBuilder span = new StringBuilder();
+    String previousToken = null;
 
-    // -1 means there is no follow-up token, so it is a single-token span.
-    int index = -1;
-    int characterEnd = 0;
-
-    // Starts at the span start in the vector.
-    // Looks at the next token to see if it is an I-PER.
-    // Go until the next token is something other than I-PER.
-    // When the next token is not I-PER, return the previous index.
-
-    for (int x = startIndex + 1; x < v[0].length; x++) {
-
-      // Get the next item.
-      final float[] arr = v[0][x];
-
-      // See if the next token has an I-PER label.
-      final String nextTokenClassification = id2Labels.get(maxIndex(arr));
+    for (int x = startIndex; x <= endIndex && x < tokens.length; x++) {
+      final String token = tokens[x];
+      if ("[CLS]".equals(token) || SEPARATOR.equals(token)) {
+        continue;
+      }
 
-      if (!I_PER.equals(nextTokenClassification)) {
-        index = x - 1;
-        break;
+      final boolean subword = token.startsWith(CHARS_TO_REPLACE);
+      final String surface = subword ? 
token.substring(CHARS_TO_REPLACE.length()) : token;
+      if (surface.isEmpty()) {
+        continue;
       }
 
+      if (span.length() > 0 && !subword && shouldInsertSpace(previousToken, 
surface)) {
+        span.append(' ');
+      }
+      span.append(surface);
+      previousToken = surface;
     }
 
-    // Find where the span ends based on the tokens.
-    for (int x = 1; x <= index && x < tokens.length; x++) {
-      characterEnd += tokens[x].length();
-    }
+    return span.toString();
+
+  }
 
-    // Account for the number of spaces (that is the number of tokens).
-    // (One space per token.)
-    characterEnd += index - 1;
+  private static boolean shouldInsertSpace(String previousToken, String token) 
{
+    return previousToken != null && !hasNoSpaceBefore(token) && 
!hasNoSpaceAfter(previousToken);
+  }
 
-    return new SpanEnd(index, characterEnd);
+  private static boolean hasNoSpaceBefore(String token) {
+    return switch (token) {
+      case ".", ",", ":", ";", "!", "?", ")", "]", "}", "%", "'", "-", "/" -> 
true;
+      default -> false;
+    };
+  }
 
+  private static boolean hasNoSpaceAfter(String token) {
+    return switch (token) {
+      case "(", "[", "{", "$", "'", "-", "/" -> true;
+      default -> false;
+    };
   }
 
-  private int maxIndex(float[] arr) {
+  static int maxIndex(float[] arr) {
 
     double max = Float.NEGATIVE_INFINITY;
     int index = -1;
 
     for (int x = 0; x < arr.length; x++) {
-      if (arr[x] > max) {
+      if (!Float.isNaN(arr[x]) && (index == -1 || arr[x] > max)) {
         index = x;
         max = arr[x];
       }
     }
 
+    if (index == -1) {
+      throw new IllegalArgumentException("Model output scores must contain at 
least one value.");
+    }

Review Comment:
   `maxIndex` now ignores NaN values, but the thrown message says "at least one 
value". When the array is non-empty but all values are NaN, the real 
requirement is at least one non-NaN (finite) score. Updating the message will 
make failures clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1846 - Recognize all entity types in NameFinderDL, not only p… (opennlp)

Reply via email to