rzo1 commented on PR #1084: URL: https://github.com/apache/opennlp/pull/1084#issuecomment-4702467577
Thanks for the PR! A few things I'd like to resolve before merge: **1. `tokenize()` was rewritten in `DocumentCategorizerDL` but not in `NameFinderDL` — they now produce different chunkings.** The doccat loop became a `while` with `start = end == length ? end : end - splitOverlapSize`, while NameFinderDL keeps the original `for`-loop. This isn't just stylistic — it changes the tail behavior. For `len=10, docSplit=8, overlap=4`: - new doccat → chunks `[0,8)`, `[4,10)` - old namefinder → chunks `[0,8)`, `[4,10)`, `[8,10)` (extra redundant trailing chunk) So this PR silently changes `DocumentCategorizerDL`'s output (one fewer chunk → different averaged scores on some inputs) while leaving `NameFinderDL` on the old behavior. Could we either apply the same rewrite to `NameFinderDL` (preferably extracting the shared chunking into one helper in `AbstractDL`), or revert doccat to a pure field-rename so behavior is untouched here? Either way the two siblings shouldn't disagree. **2. Comments drifted with the above.** doccat now says `// We want to overlap each chunk by 50 words for the next iteration.` while namefinder still says `// ...so scoot back 50 words...`, which no longer matches its own code. Both also still carry the hardcoded "Split ... into 200 word chunks with 50 overlapping" comment even though the values come from `InferenceOptions`. **3. API-surface changes — and is `static` needed here?** The thread-safety goal is met entirely by the final fields + safe publication; converting `loadVocab`/`createTokenizer` to `static` is an independent cleanup. The downside is that `loadVocab` goes `public` → `public static`, which is source- and binary-incompatible for any external caller. Since `static` isn't required for this PR's purpose, could we either keep these as instance methods (no break), or, if you'd prefer to keep them static, call the change out in the release notes? Same applies to the removal of `AbstractDL`'s implicit no-arg constructor (in-repo subclasses/tests are updated, so it's internally consistent — just flagging it since `AbstractDL` is `public abstract`). **Minor:** in `close()`, resetting `closed` to `false` after a failed `session.close()` allows a retry that could hit "already closed" in the native layer; treating close as terminal would be a touch safer. Happy to look again once 1 & 2 are reconciled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
