prakharchaube opened a new pull request, #918:
URL: https://github.com/apache/nutch/pull/918
## Summary
Both catch blocks in `CrawlDbFilter.map()` caught generic `Exception` and
set `url = null`, which silently dropped URLs both for legitimate filtering
reasons and for plugin
programming errors (NPE, etc.). The latter masked plugin bugs as ordinary
filtering decisions.
## Changes
**Normalizer block**
- `MalformedURLException` — the only legitimate reason to drop. Tracked
via `ErrorTracker` (`ErrorType.URL`) and no longer increments
`urlsFilteredCounter`, which conflated
filtering with malformed input.
- `RuntimeException` — logged at ERROR and tracked, URL is **not** dropped
so plugin bugs do not silently delete data.
**Filter block**
- `URLFilterException` — per the `URLFilter` contract, reserved for
internal filter failures (rejection is signaled by returning `null`). Logged at
ERROR and tracked, URL is
**not** dropped.
- `RuntimeException` — same handling as above.
All error paths now use `ErrorTracker` for categorized counters and log at
ERROR rather than WARN, per [@lewismc's recommendation in the JIRA
discussion](https://issues.apache.org/jira/browse/NUTCH-3164).
## JIRA
[NUTCH-3164](https://issues.apache.org/jira/browse/NUTCH-3164)
## Test plan
- [x] `ant compile` passes locally
- [ ] No new unit tests in this PR; happy to add one if reviewers want a
mock plugin throwing each exception type (flag as follow-up otherwise)
## Out of scope (separate tickets if desired)
- Same `catch (Exception)` pattern exists in `Injector.filterNormalize`
and ~20 other call sites of `normalizers.normalize` / `filters.filter` — per
lewismc comment, those
should be rolled out separately.
- Whether to track normalized URLs as a metric system-wide.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]