dpol1 opened a new pull request, #1928:
URL: https://github.com/apache/stormcrawler/pull/1928
## Summary
This PR addresses a set of code quality, correctness, and robustness issues
found across the `core` and `external` modules.
### Thread Safety
- Replace non-thread-safe `SimpleDateFormat` with `DateTimeFormatter` in
`FileResponse` and `DateUtils.parseDate()` in `CookieConverter`
- Replace shared mutable `Matcher` in `RefreshTag` with a `Pattern` constant
and per-call `Matcher` instances
### Resource Leaks
- Use try-with-resources for `FileInputStream` in `FileResponse` and
`InputStream` in `RegexURLFilterBase`
- Add `cleanup()` to `DebugParseFilter` to close the output stream
- Ensure Playwright tracing is always stopped via a `finally` block in
`HttpProtocol`, even when an exception is thrown
### Error Handling & Robustness
- Wrap `Long.parseLong()` / `Integer.parseInt()` in `FetcherBolt` with
try-catch to handle invalid metadata values gracefully
- Add null check for classpath resource stream in `RegexURLFilterBase`
- Fix duplicate `setExpiryDate()` call and add null guard in
`CookieConverter`
- Fix misleading error message in `CloudSearchUtils` ("must be score" →
"must NOT be score")
### Logging & Code Cleanup
- Replace all `e.printStackTrace()` calls with proper SLF4J `LOG.error()`
- Replace `URLEncoder.encode(url, "UTF-8")` (unnecessary checked exception)
with the `StandardCharsets.UTF_8` overload
- Replace manual `MessageDigest` boilerplate with `DigestUtils.sha512Hex()`
- Use actual document charset in `JsRenderingDetector` instead of hardcoded
UTF-8
### No behavior changes intended
All changes are backward-compatible. The crawl logic and output remain the
same.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]