[ 
https://issues.apache.org/jira/browse/NUTCH-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-3155:
----------------------------------------
    Description: 
A codebase-wide sweep of all MapReduce Mapper and Reducer classes identified 
six jobs that catch exceptions during processing but do not use ErrorTracker to 
emit categorized errors_* Hadoop counters. Errors in these jobs are logged but 
invisible in job metrics.

{*}Affected classes{*}:
 * LinkDb.LinkDbMapper
 * LinkDbFilter
 * CrawlDbReader.CrawlDbStatMapper
 * SegmentMerger.SegmentMergerMapper
 * ReadHostDb.ReadHostDbMapper, and 
 * UpdateHostDbReducer.

The ticket is scoped to add ErrorTracker (with the cached constructor for 
consistent counter registration) to each class and implement it within existing 
catch blocks.

This will also require new counter group constants in NutchMetrics.java.

  was:
Several MapReduce jobs have gaps in error metrics reporting, discovered during 
smoke testing on a single-node Hadoop cluster cf. 
https://ci-builds.apache.org/job/Nutch/job/Nutch-Smoke-Test-Single-Node-Hadoop-Cluster/37/

{*}Issue A{*}: Three jobs lack ErrorTracker integrationCrawlDbFilter, 
DeduplicationJob, and WebGraph all catch exceptions during processing but only 
log them -- they do not use ErrorTracker to emit errors_* Hadoop counters. This 
means errors in these jobs are silently lost from metrics, unlike all other 
Nutch MapReduce jobs (Injector, Generator, Fetcher, Parser, CrawlDb, Sitemap, 
HostDb, WARCExporter, Indexer) which do report categorized error 
counts.Affected code paths:
 * CrawlDbFilter.java (lines ~117-119, ~125-127): catches exceptions during URL 
normalization and filtering, logs a warning, but does not record via 
ErrorTracker.

 * DeduplicationJob.java (lines ~227-229, ~233-235): catches 
UnsupportedEncodingException and IllegalArgumentException during URL decoding, 
logs an error, but does not record via ErrorTracker.

 * WebGraph.java (lines ~189-191, ~213-215): catches exceptions during URL 
normalization and filtering in the mapper, logs or silently swallows them, but 
does not record via ErrorTracker.

{*}Issue B{*}: FetcherThread uses non-cached ErrorTracker 
constructorFetcherThread initializes its ErrorTracker as new 
ErrorTracker(NutchMetrics.GROUP_FETCHER) (without a context reference), while 
all other jobs use new ErrorTracker(group, context).The cached constructor 
(used by other jobs) calls initCounters(context), which registers all 9 error 
counter names (errors_total, errors_network_total, errors_protocol_total, 
errors_parsing_total, errors_url_total, errors_scoring_total, 
errors_indexing_total, errors_timeout_total, errors_other_total) with Hadoop 
upfront. These counters appear in job output even when their values are zero, 
providing a consistent and complete view.The non-cached constructor (used by 
FetcherThread) only registers per-category counters when their count is > 0. 
This results in:
 * When there are zero errors: only errors_total=0 appears, with no 
per-category breakdown.

 * When there are errors: only errors_total and the specific non-zero category 
appear (e.g., errors_other_total=1), while all other categories are absent.

This is inconsistent with every other job and makes it harder to monitor error 
distributions at a glance.

{*}Proposed fix{*}:
 # Add ErrorTracker to CrawlDbFilter, DeduplicationJob, and WebGraph, wiring it 
into the existing catch blocks and emitting counters in cleanup.

 # Change FetcherThread line ~301 from new 
ErrorTracker(NutchMetrics.GROUP_FETCHER) to new 
ErrorTracker(NutchMetrics.GROUP_FETCHER, context) to match the pattern used by 
all other jobs.


> Add ErrorTracker to remaining MapReduce jobs missing error metrics
> ------------------------------------------------------------------
>
>                 Key: NUTCH-3155
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3155
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: metrics
>    Affects Versions: 1.22
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.23
>
>
> A codebase-wide sweep of all MapReduce Mapper and Reducer classes identified 
> six jobs that catch exceptions during processing but do not use ErrorTracker 
> to emit categorized errors_* Hadoop counters. Errors in these jobs are logged 
> but invisible in job metrics.
> {*}Affected classes{*}:
>  * LinkDb.LinkDbMapper
>  * LinkDbFilter
>  * CrawlDbReader.CrawlDbStatMapper
>  * SegmentMerger.SegmentMergerMapper
>  * ReadHostDb.ReadHostDbMapper, and 
>  * UpdateHostDbReducer.
> The ticket is scoped to add ErrorTracker (with the cached constructor for 
> consistent counter registration) to each class and implement it within 
> existing catch blocks.
> This will also require new counter group constants in NutchMetrics.java.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to