[
https://issues.apache.org/jira/browse/NUTCH-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-3131:
----------------------------------------
Description:
The recent “Hadoop Metrics Analysis and Improvement Suggestions” report
(attached PDF) identified 88 counter increment operations across 17 files with
multiple anti-patterns and missing key observability signals (inconsistent
naming, repeated counter lookups, missing latency/error context, cardinality
risks, etc.).
This ticket covers the implementation of all high and medium-priority
improvements from the report (Phases 1–3). Phase 4 (external export, dashboard,
testing) will be handled in separate tickets.
Acceptance criteria for consideration
* All Hadoop counter group and counter names are defined in a single source of
truth (new class NutchMetricConstants or NutchMetrics).
* No hardcoded counter group/name strings remain in the 17 affected files.
* All frequently used counters (especially in hot paths – Fetcher,
FetcherThread, Generator, Parser, Indexer) are cached in instance variables
during setup(Context) / setup() and reused.
* Latency metrics (fetch, parse, index) are added with proper timing and
recorded via Hadoop counters (average + count).
* Error counters include error type/context where feasible (at least
class-level granularity).
* Counter naming is fully standardized (camelCase counters, PascalCase groups).
* A lightweight MetricsHelper utility class exists and is used across
components.
* Thread-safe accumulation (AtomicLong/TDigest) is consolidated via
ThreadSafeMetrics or equivalent and flushed correctly to Hadoop counters in
cleanup().
* Resource utilization metrics (queue sizes, depths) are added for Fetcher.
* Basic metrics validation is executed at the end of each job (warn on
impossible conditions).
* No regression in existing counter values (verified via existing integration
tests or new sanity job).
was:
The recent “Hadoop Metrics Analysis and Improvement Suggestions” report
(attached to NUTCH-2909) identified 88 counter increment operations across 17
files with multiple anti-patterns and missing key observability signals
(inconsistent naming, repeated counter lookups, missing latency/error context,
cardinality risks, etc.).
This ticket covers the implementation of all high and medium-priority
improvements from the report (Phases 1–3). Phase 4 (external export, dashboard,
testing) will be handled in separate tickets.
Acceptance criteria for consideration
* All Hadoop counter group and counter names are defined in a single source of
truth (new class NutchMetricConstants or NutchMetrics).
* No hardcoded counter group/name strings remain in the 17 affected files.
* All frequently used counters (especially in hot paths – Fetcher,
FetcherThread, Generator, Parser, Indexer) are cached in instance variables
during setup(Context) / setup() and reused.
* Latency metrics (fetch, parse, index) are added with proper timing and
recorded via Hadoop counters (average + count).
* Error counters include error type/context where feasible (at least
class-level granularity).
* Counter naming is fully standardized (camelCase counters, PascalCase groups).
* A lightweight MetricsHelper utility class exists and is used across
components.
* Thread-safe accumulation (AtomicLong/TDigest) is consolidated via
ThreadSafeMetrics or equivalent and flushed correctly to Hadoop counters in
cleanup().
* Resource utilization metrics (queue sizes, depths) are added for Fetcher.
* Basic metrics validation is executed at the end of each job (warn on
impossible conditions).
* No regression in existing counter values (verified via existing integration
tests or new sanity job).
> Nutch Metrics Refactoring & Enhancements
> ----------------------------------------
>
> Key: NUTCH-3131
> URL: https://issues.apache.org/jira/browse/NUTCH-3131
> Project: Nutch
> Issue Type: Improvement
> Components: metrics
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
>
> The recent “Hadoop Metrics Analysis and Improvement Suggestions” report
> (attached PDF) identified 88 counter increment operations across 17 files
> with multiple anti-patterns and missing key observability signals
> (inconsistent naming, repeated counter lookups, missing latency/error
> context, cardinality risks, etc.).
> This ticket covers the implementation of all high and medium-priority
> improvements from the report (Phases 1–3). Phase 4 (external export,
> dashboard, testing) will be handled in separate tickets.
> Acceptance criteria for consideration
> * All Hadoop counter group and counter names are defined in a single source
> of truth (new class NutchMetricConstants or NutchMetrics).
> * No hardcoded counter group/name strings remain in the 17 affected files.
> * All frequently used counters (especially in hot paths – Fetcher,
> FetcherThread, Generator, Parser, Indexer) are cached in instance variables
> during setup(Context) / setup() and reused.
> * Latency metrics (fetch, parse, index) are added with proper timing and
> recorded via Hadoop counters (average + count).
> * Error counters include error type/context where feasible (at least
> class-level granularity).
> * Counter naming is fully standardized (camelCase counters, PascalCase
> groups).
> * A lightweight MetricsHelper utility class exists and is used across
> components.
> * Thread-safe accumulation (AtomicLong/TDigest) is consolidated via
> ThreadSafeMetrics or equivalent and flushed correctly to Hadoop counters in
> cleanup().
> * Resource utilization metrics (queue sizes, depths) are added for Fetcher.
> * Basic metrics validation is executed at the end of each job (warn on
> impossible conditions).
> * No regression in existing counter values (verified via existing
> integration tests or new sanity job).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)