[
https://issues.apache.org/jira/browse/NUTCH-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on NUTCH-3155 started by Lewis John McGibbney.
---------------------------------------------------
> Missing ErrorTracker in CrawlDbFilter, DeduplicationJob, WebGraph and
> inconsistent initialization in FetcherThread
> ------------------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-3155
> URL: https://issues.apache.org/jira/browse/NUTCH-3155
> Project: Nutch
> Issue Type: Sub-task
> Components: metrics
> Affects Versions: 1.22
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
> Fix For: 1.23
>
>
> Several MapReduce jobs have gaps in error metrics reporting, discovered
> during smoke testing on a single-node Hadoop cluster cf.
> https://ci-builds.apache.org/job/Nutch/job/Nutch-Smoke-Test-Single-Node-Hadoop-Cluster/37/
> {*}Issue A{*}: Three jobs lack ErrorTracker integrationCrawlDbFilter,
> DeduplicationJob, and WebGraph all catch exceptions during processing but
> only log them -- they do not use ErrorTracker to emit errors_* Hadoop
> counters. This means errors in these jobs are silently lost from metrics,
> unlike all other Nutch MapReduce jobs (Injector, Generator, Fetcher, Parser,
> CrawlDb, Sitemap, HostDb, WARCExporter, Indexer) which do report categorized
> error counts.Affected code paths:
> * CrawlDbFilter.java (lines ~117-119, ~125-127): catches exceptions during
> URL normalization and filtering, logs a warning, but does not record via
> ErrorTracker.
> * DeduplicationJob.java (lines ~227-229, ~233-235): catches
> UnsupportedEncodingException and IllegalArgumentException during URL
> decoding, logs an error, but does not record via ErrorTracker.
> * WebGraph.java (lines ~189-191, ~213-215): catches exceptions during URL
> normalization and filtering in the mapper, logs or silently swallows them,
> but does not record via ErrorTracker.
> {*}Issue B{*}: FetcherThread uses non-cached ErrorTracker
> constructorFetcherThread initializes its ErrorTracker as new
> ErrorTracker(NutchMetrics.GROUP_FETCHER) (without a context reference), while
> all other jobs use new ErrorTracker(group, context).The cached constructor
> (used by other jobs) calls initCounters(context), which registers all 9 error
> counter names (errors_total, errors_network_total, errors_protocol_total,
> errors_parsing_total, errors_url_total, errors_scoring_total,
> errors_indexing_total, errors_timeout_total, errors_other_total) with Hadoop
> upfront. These counters appear in job output even when their values are zero,
> providing a consistent and complete view.The non-cached constructor (used by
> FetcherThread) only registers per-category counters when their count is > 0.
> This results in:
> * When there are zero errors: only errors_total=0 appears, with no
> per-category breakdown.
> * When there are errors: only errors_total and the specific non-zero
> category appear (e.g., errors_other_total=1), while all other categories are
> absent.
> This is inconsistent with every other job and makes it harder to monitor
> error distributions at a glance.
> {*}Proposed fix{*}:
> # Add ErrorTracker to CrawlDbFilter, DeduplicationJob, and WebGraph, wiring
> it into the existing catch blocks and emitting counters in cleanup.
> # Change FetcherThread line ~301 from new
> ErrorTracker(NutchMetrics.GROUP_FETCHER) to new
> ErrorTracker(NutchMetrics.GROUP_FETCHER, context) to match the pattern used
> by all other jobs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)