[ 
https://issues.apache.org/jira/browse/NUTCH-3142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18050724#comment-18050724
 ] 

ASF GitHub Bot commented on NUTCH-3142:
---------------------------------------

lewismc opened a new pull request, #882:
URL: https://github.com/apache/nutch/pull/882

   See [NUTCH-3142](https://issues.apache.org/jira/browse/NUTCH-3142) for 
background.
   
   This PR implements **Missing Error Context** (recommendation #8) from the 
Nutch Hadoop Metrics Analysis report. It introduces a centralized 
`ErrorTracker` utility that categorizes errors by type and emits structured 
Hadoop counters, replacing the previous approach of counting errors without 
categorization.
   
   ## Changes
   
   ### New Files
   
   - **`src/java/org/apache/nutch/metrics/ErrorTracker.java`** - Thread-safe 
error categorization utility that:
     - Defines 8 error categories: `NETWORK`, `PROTOCOL`, `PARSING`, `URL`, 
`SCORING`, `INDEXING`, `TIMEOUT`, `OTHER`
     - Automatically categorizes exceptions based on type and class name
     - Supports cached counters for performance in hot paths
     - Provides both local accumulation (`recordError`/`emitCounters`) and 
direct increment (`incrementCounters`) APIs
   
   - **`src/test/org/apache/nutch/metrics/TestErrorTracker.java`** - 
Comprehensive test suite with 26 tests covering:
     - Exception categorization for all error types
     - Nutch-specific exceptions (ProtocolException, ParseException, 
ScoringFilterException, etc.)
     - Cached counter initialization and usage
     - Thread safety
     - Nested cause chain handling
   
   ### Modified Files
   
   #### Metrics Constants (`NutchMetrics.java`)
   - Added standard error counter constants: `ERROR_TOTAL`, 
`ERROR_NETWORK_TOTAL`, `ERROR_PROTOCOL_TOTAL`, `ERROR_PARSING_TOTAL`, 
`ERROR_URL_TOTAL`, `ERROR_SCORING_TOTAL`, `ERROR_INDEXING_TOTAL`, 
`ERROR_TIMEOUT_TOTAL`, `ERROR_OTHER_TOTAL`
   - Removed redundant component-specific error counters (which I introduced 
initially in #871) now handled by `ErrorTracker`
   
   #### Component Integrations
   | Component | File | Changes |
   |-----------|------|---------|
   | Fetcher | `FetcherThread.java`, `Fetcher.java` | Integrated `ErrorTracker` 
for fetch error categorization |
   | Parser | `ParseSegment.java` | Added error tracking for parsing and 
scoring exceptions |
   | Indexer | `IndexerMapReduce.java` | Replaced `errorsScoringFilterCounter` 
and `errorsIndexingFilterCounter` with `ErrorTracker` |
   | Generator | `Generator.java` | Replaced URL filter and malformed URL 
counters with `ErrorTracker` |
   | Injector | `Injector.java` | Added error tracking for URL processing 
exceptions |
   | CrawlDb | `CrawlDbReducer.java` | Added error tracking for scoring filter 
exceptions |
   | HostDb | `UpdateHostDbMapper.java`, `ResolverThread.java` | Replaced 
`malformedUrlCounter` with `ErrorTracker`; added DNS resolution error tracking |
   | Sitemap | `SitemapProcessor.java` | Added error tracking for sitemap 
processing exceptions |
   | WARC | `WARCExporter.java` | Replaced `exceptionCounter` and 
`invalidUriCounter` with `ErrorTracker` |
   
   #### Dependencies (`ivy/ivy.xml`)
   - Added `mockito-core` and `mockito-junit-jupiter` (v5.18.0) as test 
dependencies. I had been thinking about doing this with some previous PR's but 
didn't want to introduce new dependencies to the project. In this case, it made 
for much cleaner more intuitive tests.
   
   ## Benefits
   
   1. **Better Debugging**: Errors are now categorized by type, making it 
easier to identify patterns
   2. **Reduced Counter Cardinality**: Uses a fixed set of error categories 
(~10 counters) instead of unlimited component-specific counters
   3. **Consistent API**: All components use the same error tracking mechanism
   4. **Performance**: Cached counters avoid repeated lookups in hot paths, 
this is consistent with #878 
   5. **Thread Safety**: `ConcurrentHashMap` ensures safe concurrent access
   
   I've incorporated these new counters locally into [nutch-grafana-resources 
collector configuration. and 
dashboards](https://github.com/lewismc/nutch-grafana-resources) and will push 
those updates entirely separately. This patch is best tested by looking at 
Hadoop Counters in STDOUT/logging. 




> Add Error Context to Metrics
> ----------------------------
>
>                 Key: NUTCH-3142
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3142
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: metrics
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.22
>
>
> Current error metrics lack granularity. While protocol status codes are 
> tracked dynamically, there's no categorization of:
>  * HTTP error codes (4xx vs 5xx)
>  * Exception types (timeout, connection refused, DNS failure)
>  * Parse failure reasons
> This makes it difficult to diagnose crawl issues from metrics alone e.g. 
> necessitating the interrogation of logs, adding complexity to 
> troubleshooting. 
> This ticket will add new error context metrics for FetcherThread, 
> ParseSegment and IndexerMapReduce. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to