[
https://issues.apache.org/jira/browse/NUTCH-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-3155:
----------------------------------------
Description:
A codebase-wide sweep of all MapReduce Mapper and Reducer classes identified
six jobs that catch exceptions during processing but do not use ErrorTracker to
emit categorized errors_* Hadoop counters. Errors in these jobs are logged but
invisible in job metrics.
{*}Affected classes{*}:
* LinkDb.LinkDbMapper
* LinkDbFilter
* CrawlDbReader.CrawlDbStatMapper
* SegmentMerger.SegmentMergerMapper
* ReadHostDb.ReadHostDbMapper, and
* UpdateHostDbReducer.
The ticket is scoped to add ErrorTracker (with the cached constructor for
consistent counter registration) to each class and implement it within existing
catch blocks.
This will also require new counter group constants in NutchMetrics.java.
was:
Several MapReduce jobs have gaps in error metrics reporting, discovered during
smoke testing on a single-node Hadoop cluster cf.
https://ci-builds.apache.org/job/Nutch/job/Nutch-Smoke-Test-Single-Node-Hadoop-Cluster/37/
{*}Issue A{*}: Three jobs lack ErrorTracker integrationCrawlDbFilter,
DeduplicationJob, and WebGraph all catch exceptions during processing but only
log them -- they do not use ErrorTracker to emit errors_* Hadoop counters. This
means errors in these jobs are silently lost from metrics, unlike all other
Nutch MapReduce jobs (Injector, Generator, Fetcher, Parser, CrawlDb, Sitemap,
HostDb, WARCExporter, Indexer) which do report categorized error
counts.Affected code paths:
* CrawlDbFilter.java (lines ~117-119, ~125-127): catches exceptions during URL
normalization and filtering, logs a warning, but does not record via
ErrorTracker.
* DeduplicationJob.java (lines ~227-229, ~233-235): catches
UnsupportedEncodingException and IllegalArgumentException during URL decoding,
logs an error, but does not record via ErrorTracker.
* WebGraph.java (lines ~189-191, ~213-215): catches exceptions during URL
normalization and filtering in the mapper, logs or silently swallows them, but
does not record via ErrorTracker.
{*}Issue B{*}: FetcherThread uses non-cached ErrorTracker
constructorFetcherThread initializes its ErrorTracker as new
ErrorTracker(NutchMetrics.GROUP_FETCHER) (without a context reference), while
all other jobs use new ErrorTracker(group, context).The cached constructor
(used by other jobs) calls initCounters(context), which registers all 9 error
counter names (errors_total, errors_network_total, errors_protocol_total,
errors_parsing_total, errors_url_total, errors_scoring_total,
errors_indexing_total, errors_timeout_total, errors_other_total) with Hadoop
upfront. These counters appear in job output even when their values are zero,
providing a consistent and complete view.The non-cached constructor (used by
FetcherThread) only registers per-category counters when their count is > 0.
This results in:
* When there are zero errors: only errors_total=0 appears, with no
per-category breakdown.
* When there are errors: only errors_total and the specific non-zero category
appear (e.g., errors_other_total=1), while all other categories are absent.
This is inconsistent with every other job and makes it harder to monitor error
distributions at a glance.
{*}Proposed fix{*}:
# Add ErrorTracker to CrawlDbFilter, DeduplicationJob, and WebGraph, wiring it
into the existing catch blocks and emitting counters in cleanup.
# Change FetcherThread line ~301 from new
ErrorTracker(NutchMetrics.GROUP_FETCHER) to new
ErrorTracker(NutchMetrics.GROUP_FETCHER, context) to match the pattern used by
all other jobs.
> Add ErrorTracker to remaining MapReduce jobs missing error metrics
> ------------------------------------------------------------------
>
> Key: NUTCH-3155
> URL: https://issues.apache.org/jira/browse/NUTCH-3155
> Project: Nutch
> Issue Type: Sub-task
> Components: metrics
> Affects Versions: 1.22
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
> Fix For: 1.23
>
>
> A codebase-wide sweep of all MapReduce Mapper and Reducer classes identified
> six jobs that catch exceptions during processing but do not use ErrorTracker
> to emit categorized errors_* Hadoop counters. Errors in these jobs are logged
> but invisible in job metrics.
> {*}Affected classes{*}:
> * LinkDb.LinkDbMapper
> * LinkDbFilter
> * CrawlDbReader.CrawlDbStatMapper
> * SegmentMerger.SegmentMergerMapper
> * ReadHostDb.ReadHostDbMapper, and
> * UpdateHostDbReducer.
> The ticket is scoped to add ErrorTracker (with the cached constructor for
> consistent counter registration) to each class and implement it within
> existing catch blocks.
> This will also require new counter group constants in NutchMetrics.java.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)