Lewis John McGibbney created NUTCH-3155:
-------------------------------------------

             Summary: Missing ErrorTracker in CrawlDbFilter, DeduplicationJob, 
WebGraph and inconsistent initialization in FetcherThread
                 Key: NUTCH-3155
                 URL: https://issues.apache.org/jira/browse/NUTCH-3155
             Project: Nutch
          Issue Type: Sub-task
          Components: metrics
    Affects Versions: 1.22
            Reporter: Lewis John McGibbney
            Assignee: Lewis John McGibbney
             Fix For: 1.23


Several MapReduce jobs have gaps in error metrics reporting, discovered during 
smoke testing on a single-node Hadoop cluster cf. 
https://ci-builds.apache.org/job/Nutch/job/Nutch-Smoke-Test-Single-Node-Hadoop-Cluster/37/

{*}Issue A{*}: Three jobs lack ErrorTracker integrationCrawlDbFilter, 
DeduplicationJob, and WebGraph all catch exceptions during processing but only 
log them -- they do not use ErrorTracker to emit errors_* Hadoop counters. This 
means errors in these jobs are silently lost from metrics, unlike all other 
Nutch MapReduce jobs (Injector, Generator, Fetcher, Parser, CrawlDb, Sitemap, 
HostDb, WARCExporter, Indexer) which do report categorized error 
counts.Affected code paths:
 * CrawlDbFilter.java (lines ~117-119, ~125-127): catches exceptions during URL 
normalization and filtering, logs a warning, but does not record via 
ErrorTracker.

 * DeduplicationJob.java (lines ~227-229, ~233-235): catches 
UnsupportedEncodingException and IllegalArgumentException during URL decoding, 
logs an error, but does not record via ErrorTracker.

 * WebGraph.java (lines ~189-191, ~213-215): catches exceptions during URL 
normalization and filtering in the mapper, logs or silently swallows them, but 
does not record via ErrorTracker.

{*}Issue B{*}: FetcherThread uses non-cached ErrorTracker 
constructorFetcherThread initializes its ErrorTracker as new 
ErrorTracker(NutchMetrics.GROUP_FETCHER) (without a context reference), while 
all other jobs use new ErrorTracker(group, context).The cached constructor 
(used by other jobs) calls initCounters(context), which registers all 9 error 
counter names (errors_total, errors_network_total, errors_protocol_total, 
errors_parsing_total, errors_url_total, errors_scoring_total, 
errors_indexing_total, errors_timeout_total, errors_other_total) with Hadoop 
upfront. These counters appear in job output even when their values are zero, 
providing a consistent and complete view.The non-cached constructor (used by 
FetcherThread) only registers per-category counters when their count is > 0. 
This results in:
 * When there are zero errors: only errors_total=0 appears, with no 
per-category breakdown.

 * When there are errors: only errors_total and the specific non-zero category 
appear (e.g., errors_other_total=1), while all other categories are absent.

This is inconsistent with every other job and makes it harder to monitor error 
distributions at a glance.

{*}Proposed fix{*}:
 # Add ErrorTracker to CrawlDbFilter, DeduplicationJob, and WebGraph, wiring it 
into the existing catch blocks and emitting counters in cleanup.

 # Change FetcherThread line ~301 from new 
ErrorTracker(NutchMetrics.GROUP_FETCHER) to new 
ErrorTracker(NutchMetrics.GROUP_FETCHER, context) to match the pattern used by 
all other jobs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to