[
https://issues.apache.org/jira/browse/NUTCH-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063563#comment-18063563
]
Lewis John McGibbney commented on NUTCH-3162:
---------------------------------------------
My understanding of the current problems (I only looked at Fetcher for now) are
as follows
# {*}Multi-thread{*}: each _*FetcherThread*_ has its own _*LatencyTracker*_
and calls _*emitCounters(context)*_ in its _*finally*_ block. As you mention
Seb, _*emitCounters*_ uses _*setValue()*_ for all counters, so only the last
thread to finish is visible; other threads’ data is lost.
# {*}Multi-task{*}: Hadoop aggregates task counters to job level by summing
them. So _*count_total*_ and _*sum_ms*_ are correct at the job level, but
{_}*p50_ms*{_}, _*p95_ms*_ and _*p99_ms*_ are summed across tasks, which is
invalid (e.g. _*p50_task1*_ + _*p50_task2*_ + …). Confirming your observations
Seb.
The solution could be to
# {*}Thread merge{*}: Merge all thread LatencyTrackers into one and emit once
from the mapper after all threads finish. Again I've only looked at Fetcher for
now. We can use TDigest’s merge support (an implementation is already available
in
[CrawlDbReader|https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDbReader.java#L88])
so percentiles are correct for the combined thread data.
# {*}Task merge{*}: Keep count/sum as-is (they already aggregate correctly).
For percentiles across tasks, stop relying on job-level p50/p95/p99 counters
(which are summed). Instead, emit histogram bucket counters from each task; at
job level these sum to the full distribution. Add a post-job step (or utility)
that reads the histogram counters and computes approximate p50/p95/p99 for the
whole job.
I'll continue investigating and propose a PR at some point.
> Latency metrics to properly merge data from all threads and tasks
> -----------------------------------------------------------------
>
> Key: NUTCH-3162
> URL: https://issues.apache.org/jira/browse/NUTCH-3162
> Project: Nutch
> Issue Type: Bug
> Components: fetcher, indexer, parser
> Affects Versions: 1.22
> Reporter: Sebastian Nagel
> Assignee: Lewis John McGibbney
> Priority: Major
> Fix For: 1.23
>
>
> The latency metrics (NUTCH-3134) have to issues:
> 1. Only the data from one thread is used, in case, a tool is multi-threaded.
> That's definitely the case for Fetcher. The "emitCounters" methods needs to
> increment the counter values, instead of calling "setValue". However, this is
> not the correct approach for the percentiles, see also next point.
> 2. If running full cluster mode with multiple parallel tasks, the task
> counters are summed up to the job counter value. However, the values of the
> latency percentiles then turn out to be too high.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)