[ 
https://issues.apache.org/jira/browse/NUTCH-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-3155 started by Lewis John McGibbney.
---------------------------------------------------
> Missing ErrorTracker in CrawlDbFilter, DeduplicationJob, WebGraph and 
> inconsistent initialization in FetcherThread
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-3155
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3155
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: metrics
>    Affects Versions: 1.22
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.23
>
>
> Several MapReduce jobs have gaps in error metrics reporting, discovered 
> during smoke testing on a single-node Hadoop cluster cf. 
> https://ci-builds.apache.org/job/Nutch/job/Nutch-Smoke-Test-Single-Node-Hadoop-Cluster/37/
> {*}Issue A{*}: Three jobs lack ErrorTracker integrationCrawlDbFilter, 
> DeduplicationJob, and WebGraph all catch exceptions during processing but 
> only log them -- they do not use ErrorTracker to emit errors_* Hadoop 
> counters. This means errors in these jobs are silently lost from metrics, 
> unlike all other Nutch MapReduce jobs (Injector, Generator, Fetcher, Parser, 
> CrawlDb, Sitemap, HostDb, WARCExporter, Indexer) which do report categorized 
> error counts.Affected code paths:
>  * CrawlDbFilter.java (lines ~117-119, ~125-127): catches exceptions during 
> URL normalization and filtering, logs a warning, but does not record via 
> ErrorTracker.
>  * DeduplicationJob.java (lines ~227-229, ~233-235): catches 
> UnsupportedEncodingException and IllegalArgumentException during URL 
> decoding, logs an error, but does not record via ErrorTracker.
>  * WebGraph.java (lines ~189-191, ~213-215): catches exceptions during URL 
> normalization and filtering in the mapper, logs or silently swallows them, 
> but does not record via ErrorTracker.
> {*}Issue B{*}: FetcherThread uses non-cached ErrorTracker 
> constructorFetcherThread initializes its ErrorTracker as new 
> ErrorTracker(NutchMetrics.GROUP_FETCHER) (without a context reference), while 
> all other jobs use new ErrorTracker(group, context).The cached constructor 
> (used by other jobs) calls initCounters(context), which registers all 9 error 
> counter names (errors_total, errors_network_total, errors_protocol_total, 
> errors_parsing_total, errors_url_total, errors_scoring_total, 
> errors_indexing_total, errors_timeout_total, errors_other_total) with Hadoop 
> upfront. These counters appear in job output even when their values are zero, 
> providing a consistent and complete view.The non-cached constructor (used by 
> FetcherThread) only registers per-category counters when their count is > 0. 
> This results in:
>  * When there are zero errors: only errors_total=0 appears, with no 
> per-category breakdown.
>  * When there are errors: only errors_total and the specific non-zero 
> category appear (e.g., errors_other_total=1), while all other categories are 
> absent.
> This is inconsistent with every other job and makes it harder to monitor 
> error distributions at a glance.
> {*}Proposed fix{*}:
>  # Add ErrorTracker to CrawlDbFilter, DeduplicationJob, and WebGraph, wiring 
> it into the existing catch blocks and emitting counters in cleanup.
>  # Change FetcherThread line ~301 from new 
> ErrorTracker(NutchMetrics.GROUP_FETCHER) to new 
> ErrorTracker(NutchMetrics.GROUP_FETCHER, context) to match the pattern used 
> by all other jobs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to