[ 
https://issues.apache.org/jira/browse/NUTCH-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18065082#comment-18065082
 ] 

ASF GitHub Bot commented on NUTCH-3162:
---------------------------------------

lewismc opened a new pull request, #906:
URL: https://github.com/apache/nutch/pull/906

   PR for [NUTCH-3162](https://issues.apache.org/jira/browse/NUTCH-3162) which 
addresses shortcomings in job-level latency percentiles (p50, p95, p99) for 
Fetcher, ParseSegment, and Indexer by merging TDigest data from all map tasks 
and threads and writing counters in a single reducer (or a dedicated merge job 
for Indexer). It should fix the cases where per-task counters were summed and 
percentiles were not merged.
   
   This patch touches the following jobs
   * Fetcher: Per-thread latency merged in mapper; single reducer merges 
TDigests and sets job-level p50/p95/p99.
   * ParseSegment:
     * Mapper emits latency digest under `LATENCY_KEY`
     * Custom partitioner sends `LATENCY_KEY` to partition 0 so one reducer 
merges all TDigests
     * Reducer merges and sets correct percentile counters.
   * Indexer:
     * Reducer writes TDigest to side output
     * IndexingJob runs *a new* “Indexer Latency Merge” job which merges 
reducer sets percentile counters. On merge failure: `LOG.error` and 
driver-level `ErrorTracker` categorization is only run.
   
   I think this fixes the issues. Arguably it is more complex than logging to 
file and performing some ETL to extract metrics from logs however this solution 
does stick with convention by keeping metrics within the Hadoop ecosystem.
   
   




> Latency metrics to properly merge data from all threads and tasks
> -----------------------------------------------------------------
>
>                 Key: NUTCH-3162
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3162
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, indexer, parser
>    Affects Versions: 1.22
>            Reporter: Sebastian Nagel
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.23
>
>
> The latency metrics (NUTCH-3134) have to issues:
> 1. Only the data from one thread is used, in case, a tool is multi-threaded. 
> That's definitely the case for Fetcher. The "emitCounters" methods needs to 
> increment the counter values, instead of calling "setValue". However, this is 
> not the correct approach for the percentiles, see also next point.
> 2. If running full cluster mode with multiple parallel tasks, the task 
> counters are summed up to the job counter value. However, the values of the 
> latency percentiles then turn out to be too high.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to