[PR] NUTCH-3162 Latency metrics to properly merge data from all threads and tasks [nutch]

via GitHub Wed, 11 Mar 2026 06:25:24 -0700


lewismc opened a new pull request, #906:
URL: https://github.com/apache/nutch/pull/906


   PR for [NUTCH-3162](https://issues.apache.org/jira/browse/NUTCH-3162) which 
addresses shortcomings in job-level latency percentiles (p50, p95, p99) for 
Fetcher, ParseSegment, and Indexer by merging TDigest data from all map tasks 
and threads and writing counters in a single reducer (or a dedicated merge job 
for Indexer). It should fix the cases where per-task counters were summed and 
percentiles were not merged.
   
   This patch touches the following jobs
   * Fetcher: Per-thread latency merged in mapper; single reducer merges 
TDigests and sets job-level p50/p95/p99.
   * ParseSegment:
     * Mapper emits latency digest under `LATENCY_KEY`
     * Custom partitioner sends `LATENCY_KEY` to partition 0 so one reducer 
merges all TDigests
     * Reducer merges and sets correct percentile counters.
   * Indexer:
     * Reducer writes TDigest to side output
     * IndexingJob runs *a new* “Indexer Latency Merge” job which merges 
reducer sets percentile counters. On merge failure: `LOG.error` and 
driver-level `ErrorTracker` categorization is only run.
   
   I think this fixes the issues. Arguably it is more complex than logging to 
file and performing some ETL to extract metrics from logs however this solution 
does stick with convention by keeping metrics within the Hadoop ecosystem.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] NUTCH-3162 Latency metrics to properly merge data from all threads and tasks [nutch]

Reply via email to