[jira] [Updated] (NUTCH-3131) Nutch Metrics Refactoring & Enhancements

Lewis John McGibbney (Jira) Mon, 08 Dec 2025 14:29:20 -0800


     [ 
https://issues.apache.org/jira/browse/NUTCH-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lewis John McGibbney updated NUTCH-3131:
----------------------------------------
    Description: 
The recent “Hadoop Metrics Analysis and Improvement Suggestions” report 
(attached PDF) identified 88 counter increment operations across 17 files with 
multiple anti-patterns and missing key observability signals (inconsistent 
naming, repeated counter lookups, missing latency/error context, cardinality 
risks, etc.).

This ticket covers the implementation of all high and medium-priority 
improvements from the report (Phases 1–3). Phase 4 (external export, dashboard, 
testing) will be handled in separate tickets.

Acceptance criteria for consideration
 * All Hadoop counter group and counter names are defined in a single source of 
truth (new class NutchMetricConstants or NutchMetrics).
 * No hardcoded counter group/name strings remain in the 17 affected files.
 * All frequently used counters (especially in hot paths – Fetcher, 
FetcherThread, Generator, Parser, Indexer) are cached in instance variables 
during setup(Context) / setup() and reused.
 * Latency metrics (fetch, parse, index) are added with proper timing and 
recorded via Hadoop counters (average + count).
 * Error counters include error type/context where feasible (at least 
class-level granularity).
 * Counter naming is fully standardized (camelCase counters, PascalCase groups).
 * A lightweight MetricsHelper utility class exists and is used across 
components.
 * Thread-safe accumulation (AtomicLong/TDigest) is consolidated via 
ThreadSafeMetrics or equivalent and flushed correctly to Hadoop counters in 
cleanup().
 * Resource utilization metrics (queue sizes, depths) are added for Fetcher.
 * Basic metrics validation is executed at the end of each job (warn on 
impossible conditions).
 * No regression in existing counter values (verified via existing integration 
tests or new sanity job).

  was:
The recent “Hadoop Metrics Analysis and Improvement Suggestions” report 
(attached to NUTCH-2909) identified 88 counter increment operations across 17 
files with multiple anti-patterns and missing key observability signals 
(inconsistent naming, repeated counter lookups, missing latency/error context, 
cardinality risks, etc.).

This ticket covers the implementation of all high and medium-priority 
improvements from the report (Phases 1–3). Phase 4 (external export, dashboard, 
testing) will be handled in separate tickets.

Acceptance criteria for consideration
 * All Hadoop counter group and counter names are defined in a single source of 
truth (new class NutchMetricConstants or NutchMetrics).
 * No hardcoded counter group/name strings remain in the 17 affected files.
 * All frequently used counters (especially in hot paths – Fetcher, 
FetcherThread, Generator, Parser, Indexer) are cached in instance variables 
during setup(Context) / setup() and reused.
 * Latency metrics (fetch, parse, index) are added with proper timing and 
recorded via Hadoop counters (average + count).
 * Error counters include error type/context where feasible (at least 
class-level granularity).
 * Counter naming is fully standardized (camelCase counters, PascalCase groups).
 * A lightweight MetricsHelper utility class exists and is used across 
components.
 * Thread-safe accumulation (AtomicLong/TDigest) is consolidated via 
ThreadSafeMetrics or equivalent and flushed correctly to Hadoop counters in 
cleanup().
 * Resource utilization metrics (queue sizes, depths) are added for Fetcher.
 * Basic metrics validation is executed at the end of each job (warn on 
impossible conditions).
 * No regression in existing counter values (verified via existing integration 
tests or new sanity job).


> Nutch Metrics Refactoring & Enhancements
> ----------------------------------------
>
>                 Key: NUTCH-3131
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3131
>             Project: Nutch
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>
> The recent “Hadoop Metrics Analysis and Improvement Suggestions” report 
> (attached PDF) identified 88 counter increment operations across 17 files 
> with multiple anti-patterns and missing key observability signals 
> (inconsistent naming, repeated counter lookups, missing latency/error 
> context, cardinality risks, etc.).
> This ticket covers the implementation of all high and medium-priority 
> improvements from the report (Phases 1–3). Phase 4 (external export, 
> dashboard, testing) will be handled in separate tickets.
> Acceptance criteria for consideration
>  * All Hadoop counter group and counter names are defined in a single source 
> of truth (new class NutchMetricConstants or NutchMetrics).
>  * No hardcoded counter group/name strings remain in the 17 affected files.
>  * All frequently used counters (especially in hot paths – Fetcher, 
> FetcherThread, Generator, Parser, Indexer) are cached in instance variables 
> during setup(Context) / setup() and reused.
>  * Latency metrics (fetch, parse, index) are added with proper timing and 
> recorded via Hadoop counters (average + count).
>  * Error counters include error type/context where feasible (at least 
> class-level granularity).
>  * Counter naming is fully standardized (camelCase counters, PascalCase 
> groups).
>  * A lightweight MetricsHelper utility class exists and is used across 
> components.
>  * Thread-safe accumulation (AtomicLong/TDigest) is consolidated via 
> ThreadSafeMetrics or equivalent and flushed correctly to Hadoop counters in 
> cleanup().
>  * Resource utilization metrics (queue sizes, depths) are added for Fetcher.
>  * Basic metrics validation is executed at the end of each job (warn on 
> impossible conditions).
>  * No regression in existing counter values (verified via existing 
> integration tests or new sanity job).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3131) Nutch Metrics Refactoring & Enhancements

Reply via email to