[
https://issues.apache.org/jira/browse/NUTCH-3142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050724#comment-18050724
]
ASF GitHub Bot commented on NUTCH-3142:
---------------------------------------
lewismc opened a new pull request, #882:
URL: https://github.com/apache/nutch/pull/882
See [NUTCH-3142](https://issues.apache.org/jira/browse/NUTCH-3142) for
background.
This PR implements **Missing Error Context** (recommendation #8) from the
Nutch Hadoop Metrics Analysis report. It introduces a centralized
`ErrorTracker` utility that categorizes errors by type and emits structured
Hadoop counters, replacing the previous approach of counting errors without
categorization.
## Changes
### New Files
- **`src/java/org/apache/nutch/metrics/ErrorTracker.java`** - Thread-safe
error categorization utility that:
- Defines 8 error categories: `NETWORK`, `PROTOCOL`, `PARSING`, `URL`,
`SCORING`, `INDEXING`, `TIMEOUT`, `OTHER`
- Automatically categorizes exceptions based on type and class name
- Supports cached counters for performance in hot paths
- Provides both local accumulation (`recordError`/`emitCounters`) and
direct increment (`incrementCounters`) APIs
- **`src/test/org/apache/nutch/metrics/TestErrorTracker.java`** -
Comprehensive test suite with 26 tests covering:
- Exception categorization for all error types
- Nutch-specific exceptions (ProtocolException, ParseException,
ScoringFilterException, etc.)
- Cached counter initialization and usage
- Thread safety
- Nested cause chain handling
### Modified Files
#### Metrics Constants (`NutchMetrics.java`)
- Added standard error counter constants: `ERROR_TOTAL`,
`ERROR_NETWORK_TOTAL`, `ERROR_PROTOCOL_TOTAL`, `ERROR_PARSING_TOTAL`,
`ERROR_URL_TOTAL`, `ERROR_SCORING_TOTAL`, `ERROR_INDEXING_TOTAL`,
`ERROR_TIMEOUT_TOTAL`, `ERROR_OTHER_TOTAL`
- Removed redundant component-specific error counters (which I introduced
initially in #871) now handled by `ErrorTracker`
#### Component Integrations
| Component | File | Changes |
|-----------|------|---------|
| Fetcher | `FetcherThread.java`, `Fetcher.java` | Integrated `ErrorTracker`
for fetch error categorization |
| Parser | `ParseSegment.java` | Added error tracking for parsing and
scoring exceptions |
| Indexer | `IndexerMapReduce.java` | Replaced `errorsScoringFilterCounter`
and `errorsIndexingFilterCounter` with `ErrorTracker` |
| Generator | `Generator.java` | Replaced URL filter and malformed URL
counters with `ErrorTracker` |
| Injector | `Injector.java` | Added error tracking for URL processing
exceptions |
| CrawlDb | `CrawlDbReducer.java` | Added error tracking for scoring filter
exceptions |
| HostDb | `UpdateHostDbMapper.java`, `ResolverThread.java` | Replaced
`malformedUrlCounter` with `ErrorTracker`; added DNS resolution error tracking |
| Sitemap | `SitemapProcessor.java` | Added error tracking for sitemap
processing exceptions |
| WARC | `WARCExporter.java` | Replaced `exceptionCounter` and
`invalidUriCounter` with `ErrorTracker` |
#### Dependencies (`ivy/ivy.xml`)
- Added `mockito-core` and `mockito-junit-jupiter` (v5.18.0) as test
dependencies. I had been thinking about doing this with some previous PR's but
didn't want to introduce new dependencies to the project. In this case, it made
for much cleaner more intuitive tests.
## Benefits
1. **Better Debugging**: Errors are now categorized by type, making it
easier to identify patterns
2. **Reduced Counter Cardinality**: Uses a fixed set of error categories
(~10 counters) instead of unlimited component-specific counters
3. **Consistent API**: All components use the same error tracking mechanism
4. **Performance**: Cached counters avoid repeated lookups in hot paths,
this is consistent with #878
5. **Thread Safety**: `ConcurrentHashMap` ensures safe concurrent access
I've incorporated these new counters locally into [nutch-grafana-resources
collector configuration. and
dashboards](https://github.com/lewismc/nutch-grafana-resources) and will push
those updates entirely separately. This patch is best tested by looking at
Hadoop Counters in STDOUT/logging.
> Add Error Context to Metrics
> ----------------------------
>
> Key: NUTCH-3142
> URL: https://issues.apache.org/jira/browse/NUTCH-3142
> Project: Nutch
> Issue Type: Sub-task
> Components: metrics
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Major
> Fix For: 1.22
>
>
> Current error metrics lack granularity. While protocol status codes are
> tracked dynamically, there's no categorization of:
> * HTTP error codes (4xx vs 5xx)
> * Exception types (timeout, connection refused, DNS failure)
> * Parse failure reasons
> This makes it difficult to diagnose crawl issues from metrics alone e.g.
> necessitating the interrogation of logs, adding complexity to
> troubleshooting.
> This ticket will add new error context metrics for FetcherThread,
> ParseSegment and IndexerMapReduce.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)