[ 
https://issues.apache.org/jira/browse/NUTCH-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063563#comment-18063563
 ] 

Lewis John McGibbney commented on NUTCH-3162:
---------------------------------------------

My understanding of the current problems (I only looked at Fetcher for now) are 
as follows 
 # {*}Multi-thread{*}: each _*FetcherThread*_ has its own _*LatencyTracker*_ 
and calls _*emitCounters(context)*_ in its _*finally*_ block. As you mention 
Seb, _*emitCounters*_ uses _*setValue()*_ for all counters, so only the last 
thread to finish is visible; other threads’ data is lost.
 # {*}Multi-task{*}: Hadoop aggregates task counters to job level by summing 
them. So _*count_total*_ and _*sum_ms*_ are correct at the job level, but 
{_}*p50_ms*{_}, _*p95_ms*_ and _*p99_ms*_ are summed across tasks, which is 
invalid (e.g. _*p50_task1*_ + _*p50_task2*_ + …). Confirming your observations 
Seb.

The solution could be to 
 # {*}Thread merge{*}: Merge all thread LatencyTrackers into one and emit once 
from the mapper after all threads finish. Again I've only looked at Fetcher for 
now. We can use TDigest’s merge support (an implementation is already available 
in 
[CrawlDbReader|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDbReader.java#L88])
 so percentiles are correct for the combined thread data.

 # {*}Task merge{*}: Keep count/sum as-is (they already aggregate correctly). 
For percentiles across tasks, stop relying on job-level p50/p95/p99 counters 
(which are summed). Instead, emit histogram bucket counters from each task; at 
job level these sum to the full distribution. Add a post-job step (or utility) 
that reads the histogram counters and computes approximate p50/p95/p99 for the 
whole job.

I'll continue investigating and propose a PR at some point.

> Latency metrics to properly merge data from all threads and tasks
> -----------------------------------------------------------------
>
>                 Key: NUTCH-3162
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3162
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, indexer, parser
>    Affects Versions: 1.22
>            Reporter: Sebastian Nagel
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.23
>
>
> The latency metrics (NUTCH-3134) have to issues:
> 1. Only the data from one thread is used, in case, a tool is multi-threaded. 
> That's definitely the case for Fetcher. The "emitCounters" methods needs to 
> increment the counter values, instead of calling "setValue". However, this is 
> not the correct approach for the percentiles, see also next point.
> 2. If running full cluster mode with multiple parallel tasks, the task 
> counters are summed up to the job counter value. However, the values of the 
> latency percentiles then turn out to be too high.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to