[
https://issues.apache.org/jira/browse/NUTCH-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084612#comment-18084612
]
Hudson commented on NUTCH-3164:
-------------------------------
SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #238 (See
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/238/])
NUTCH-3164 Catch specific exceptions in CrawlDbFilter so plugin errors no
longer silently drop URLs (prakharchaube13:
[https://github.com/apache/nutch/commit/0c4f51d8382542dadf9373c6e524d7f92f901076])
* (edit) src/java/org/apache/nutch/crawl/CrawlDbFilter.java
NUTCH-3164 Added Unit Test case (prakharchaube13:
[https://github.com/apache/nutch/commit/fc26180cc55810a2ce0d2bb3ab90181394200f96])
* (add) src/test/org/apache/nutch/crawl/TestCrawlDbFilterExceptionHandling.java
> Generic exceptions in catch block may lead to deletion of links from crawldb
> ----------------------------------------------------------------------------
>
> Key: NUTCH-3164
> URL: https://issues.apache.org/jira/browse/NUTCH-3164
> Project: Nutch
> Issue Type: Bug
> Components: crawldb
> Affects Versions: 1.22
> Reporter: Prakhar Chaube
> Priority: Critical
> Fix For: 1.23
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> In CrawlDbFilter.java (lines ~107-121), both the URL normalization and URL
> filtering blocks catch Exception instead of the specific checked exceptions
> declared by URLNormalizers.normalize() (MalformedURLException) and
> URLFilters.filter() (URLFilterException).
> try {
> url = normalizers.normalize(url, scope);
> } catch (Exception e) {
> LOG.warn("Skipping {}: ", url, e);
> url = null;
> }
> try {
> url = filters.filter(url);
> } catch (Exception e) {
> LOG.warn("Skipping {}: ", url, e);
> url = null;
> }
> *Problem:*
> Any {{RuntimeException}} (e.g., {{{}NullPointerException{}}},
> {{{}IllegalArgumentException{}}}, {{{}ArrayIndexOutOfBoundsException{}}})
> thrown by a buggy normalizer/filter plugin gets caught, logged as a WARN, and
> the URL is silently nulled out — counted as "filtered." This has two
> consequences:
> # *Silent data loss* — legitimate URLs are dropped from CrawlDb not because
> they failed normalization/filtering, but because of an unrelated bug in a
> plugin. The operator sees a WARN log but the URL is gone with no distinction
> between "bad URL" and "broken plugin."
> # *Bug masking* — {{{}RuntimeException{}}}s typically indicate programming
> errors. Swallowing them makes it significantly harder to detect and diagnose
> faulty normalizer/filter implementations, especially at scale where WARN logs
> get lost in noise.
>
> Raising as critical since this can lead to data loss.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)