Re: [PR] OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) (opennlp)

via GitHub Sun, 21 Jun 2026 14:37:20 -0700


krickert commented on PR #1105:
URL: https://github.com/apache/opennlp/pull/1105#issuecomment-4763361792


   **opennlp-dl now compile-depends on all of opennlp-runtime.**
   Fixed at the root. The minimal normalization engine (`CharClass`, 
`CodePointSet`, `UnicodeWhitespace`, `UnicodeDash`) moved to `opennlp-api`, 
which already ships small concrete implementations (e.g. `BertTokenizer`, 
`WhitespaceTokenizer`). `opennlp-dl` now compiles against `opennlp-api` plus 
onnxruntime only; `opennlp-runtime` is back to a test-only dependency.
   
   **@Deprecated on find(String[]) is too broad.**
   Fixed. The `@Deprecated` is removed. `find`'s javadoc instead documents the 
single case it diverges: whitespace folding is length-preserving, so `find` is 
only affected when `normalizeDashes` is enabled and a non-BMP dash is present, 
in which case `findInOriginal` gives the exact original mapping. Every other 
caller is correct and no longer sees a warning. (This also resolves 
`NameFinderDLEval` calling a deprecated method.)
   
   **findInOriginal is only on the concrete type, not on TokenNameFinder.**
   Added it as a capability interface: `OffsetMappingNameFinder extends 
TokenNameFinder` in `opennlp-api` declares `findInOriginal`, and `NameFinderDL` 
implements it, so an interface-typed caller (UIMA, eval harness) reaches the 
offset-correct path with `finder instanceof OffsetMappingNameFinder` — a plain 
type check, no reflection. I went with a separate interface rather than a 
default method on `TokenNameFinder` for a concrete reason: 
`TokenNameFinder.find` is documented to return *token* spans (indices into the 
token array, as `NameFinderME` does), while `findInOriginal` returns 
*character* offsets in the original input, so a `default findInOriginal(t) { 
return find(t); }` would hand back token indices where the method promises 
character offsets. The only mismatch-free way to put it directly on 
`TokenNameFinder` is a default that throws `UnsupportedOperationException`, 
which callers have to guard against anyway, so it buys nothing over the 
capability check. (For full honesty, `N
 ameFinderDL.find` itself already returns character offsets rather than token 
spans, so the DL coordinate story is a pre-existing deviation from the 
interface contract.) It's your interface, so if you'd still rather it live on 
`TokenNameFinder` in some form, say the word and I'll switch it.
   
   **Missing end-to-end test.**
   Added a real one. `NameFinderDLEval.findInOriginalMapsSpansAcrossNonBmpDash` 
runs the actual `dslim/bert-base-NER` ONNX model with `normalizeDashes` enabled 
and a non-BMP dash (Yezidi hyphen, U+10EAD) before the entity, and asserts the 
`findInOriginal` span covers "George Washington" at its true original offset 
rather than the one-unit-shorter normalized offset. It exercises the full 
normalize/tokenize/infer/decode/map-back path with no production test seam, and 
it passes. The offset-mapping math is additionally unit-tested model-free in 
`AbstractDLChunkingTest`. This also moots the "`NameFinderDLEval` still calls 
deprecated `find()`" point, since `find()` is no longer deprecated.
   
   **Nit: README/javadoc say offset-preserving "for BMP".**
   Fixed. The README now states the stronger guarantee: whitespace folding 
never moves offsets, and `findInOriginal` maps spans back through the 
`Alignment` so they stay correct in the original input even for non-BMP dashes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: Offset-safe input normalization in the DL components (3/4) (opennlp)

Reply via email to