dpol1 opened a new pull request, #1928:
URL: https://github.com/apache/stormcrawler/pull/1928

   ## Summary  
     
   This PR addresses a set of code quality, correctness, and robustness issues 
found across the `core` and `external` modules.  
     
   ### Thread Safety  
   - Replace non-thread-safe `SimpleDateFormat` with `DateTimeFormatter` in 
`FileResponse` and `DateUtils.parseDate()` in `CookieConverter`  
   - Replace shared mutable `Matcher` in `RefreshTag` with a `Pattern` constant 
and per-call `Matcher` instances  
     
   ### Resource Leaks  
   - Use try-with-resources for `FileInputStream` in `FileResponse` and 
`InputStream` in `RegexURLFilterBase`  
   - Add `cleanup()` to `DebugParseFilter` to close the output stream  
   - Ensure Playwright tracing is always stopped via a `finally` block in 
`HttpProtocol`, even when an exception is thrown  
     
   ### Error Handling & Robustness  
   - Wrap `Long.parseLong()` / `Integer.parseInt()` in `FetcherBolt` with 
try-catch to handle invalid metadata values gracefully  
   - Add null check for classpath resource stream in `RegexURLFilterBase`  
   - Fix duplicate `setExpiryDate()` call and add null guard in 
`CookieConverter`  
   - Fix misleading error message in `CloudSearchUtils` ("must be score" → 
"must NOT be score")  
     
   ### Logging & Code Cleanup  
   - Replace all `e.printStackTrace()` calls with proper SLF4J `LOG.error()`  
   - Replace `URLEncoder.encode(url, "UTF-8")` (unnecessary checked exception) 
with the `StandardCharsets.UTF_8` overload  
   - Replace manual `MessageDigest` boilerplate with `DigestUtils.sha512Hex()`  
   - Use actual document charset in `JsRenderingDetector` instead of hardcoded 
UTF-8  
     
   ### No behavior changes intended  
   All changes are backward-compatible. The crawl logic and output remain the 
same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to