krickert commented on code in PR #1086:
URL: https://github.com/apache/opennlp/pull/1086#discussion_r3409001486
##########
opennlp-core/opennlp-ml/opennlp-dl/src/main/java/opennlp/dl/namefinder/NameFinderDL.java:
##########
@@ -187,13 +193,16 @@ public Span[] find(String[] input) {
// Can we do thresholding without it between 0 and 1?
final double confidence = arr[maxIndex]; // / 10;
- // Is this is the start of a person entity.
- if (B_PER.equals(label)) {
+ // Is this the start of an entity? Any B-<TYPE> begins a span of
that type.
+ if (label != null && label.startsWith(BEGIN_PREFIX)) {
+
+ // The entity type is the label without its B- prefix, e.g.
B-ORG -> ORG.
+ final String entityType = label.substring(BEGIN_PREFIX.length());
Review Comment:
Good catch. Fixed in b829dd3f: `isBeginLabel()` now requires a non-empty
type after the `B-` prefix (`label.length() > BEGIN_PREFIX.length()`), so a
bare or malformed `B-` label no longer starts an empty-type span. Covered by
`NameFinderDLTest#testDecodeSpansIgnoresMalformedBeginLabels`.
##########
opennlp-eval-tests/src/test/java/opennlp/dl/namefinder/NameFinderDLEval.java:
##########
@@ -62,12 +62,23 @@ public void tokenNameFinder1Test() throws Exception {
logger.debug(span.toString());
}
- Assertions.assertEquals(1, spans.length);
+ final String text = String.join(" ", tokens);
+
+ // The model emits a PER and a LOC entity; the person-only decoder
previously dropped
+ // the location. Span types are the entity labels (PER/LOC), not the
matched text.
+ Assertions.assertEquals(2, spans.length);
+
+ Assertions.assertEquals("PER", spans[0].getType());
Assertions.assertEquals(0, spans[0].getStart());
Assertions.assertEquals(17, spans[0].getEnd());
Assertions.assertEquals(8.251646041870117, spans[0].getProb(), 0.00001);
- Assertions.assertEquals("George Washington",
- spans[0].getCoveredText(String.join(" ", tokens)));
+ Assertions.assertEquals("George Washington",
spans[0].getCoveredText(text));
+
+ Assertions.assertEquals("LOC", spans[1].getType());
+ Assertions.assertEquals(39, spans[1].getStart());
+ Assertions.assertEquals(52, spans[1].getEnd());
+ Assertions.assertEquals("United States", spans[1].getCoveredText(text));
+ Assertions.assertTrue(spans[1].getProb() > 0);
Review Comment:
Agreed — resolved at the source in b829dd3f. `Span.getProb()` is now a
numerically stable softmax over the token label scores (bounded to [0,1])
instead of the raw max logit, which clears the old TODO. The eval now asserts
bounded probabilities rather than an exact value, and `NameFinderDLTest` adds
model-free coverage for the [0,1] bound.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]