rzo1 commented on PR #1104:
URL: https://github.com/apache/opennlp/pull/1104#issuecomment-4779749205
Same resource-loading concern as in #1103, and here it shows up twice. Both
`ExtendedPictographic` and `WordBreakProperty` load their bundled `.txt` data
inside a `static {}` block:
```java
static {
try (InputStream in =
WordBreakProperty.class.getResourceAsStream(RESOURCE)) {
...
```
In an application container the classloader that loads these classes is
often not the loader the app runs with (servlet containers, OSGi,
shaded/relocated jars). When the class's loader can't see the resource, the
static block throws and the class is permanently poisoned:
`ExceptionInInitializerError` first, then `NoClassDefFoundError` with the
original cause gone on every later use. And because these two feed the UAX #29
`WordSegmenter`/`WordTokenizer`, a poisoned `WordBreakProperty` takes the whole
tokenizer down, not just one lookup.
Suggestion is the same as on the foundation PR: move the load behind a lazy,
recoverable accessor (double-checked `properties()` that calls `load()` on
first use) instead of an eager `static {}` block, so a failure throws a normal
catchable exception rather than killing the class for the lifetime of the
loader. As noted there, this is a general rule for any resource/model loading,
not specific to these two classes. (The `getResourceAsStream` in
`WordBoundaryConformanceTest` is test-only, so it's fine.)
### On review size
Like the foundation PR, this one is also a lot for a human to QM-review in
one pass. Of the +6,762 lines, 3,976 are bundled data (`WordBreakTest.txt`,
`WordBreakProperty.txt`, `ExtendedPictographic.txt`), so the real code is
~2,800 lines, but it still bundles three independent concepts that could land
as smaller stacked PRs:
- **2a, UAX #29 tokenizer.** `WordSegmenter`, `WordTokenizer`, `WordType`,
`WordToken`, `WordBreak`, `WordBreakProperty`, `ExtendedPictographic`, the
bundled data, and the conformance tests. Self-contained and the bulk of the
work.
- **2b, Term model.** `Term`, `TermAnalyzer` and their tests.
- **2c, NormalizationProfile registry.** `NormalizationProfile`,
`NormalizationProfiles` and their tests.
Smaller PRs over one large one give the reviewer a real chance to read each
layer rather than skim a 6.7k diff. If re-stacking is more churn than it's
worth, at minimum the description should point reviewers past the data files to
the ~2,800 lines that actually need eyes. Generally I'd favour the step-by-step
route: smaller, conceptually-scoped PRs over one big one, even when they have
to stack.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]