[ 
https://issues.apache.org/jira/browse/NUTCH-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18044679#comment-18044679
 ] 

ASF GitHub Bot commented on NUTCH-3134:
---------------------------------------

lewismc opened a new pull request, #876:
URL: https://github.com/apache/nutch/pull/876

   PR for [NUTCH-3134](https://issues.apache.org/jira/browse/NUTCH-3134). 
Notably, this PR introduces a new Class named `LatencyTracker.java` which 
tracks latency metrics. The implementation wraps the TDigest data structure to 
collect latency samples and emit Hadoop counters with count, sum, and 
percentile values (p50, p95, p99). Note this is limited to Fetcher, Parser and 
Indexer jobs right now but could certainly be extended to other jobs in the 
future.
   
   One note for any reviewers, please sanity check
   
   1. latency start ands stop boundaries are accurate.
   2. counters are emitted at the correct times.
   
   Thanks for any review. Local testing is favorable. My next step will be to 
share my WIP Nutch observability solution via user@ .
   
   
   
   




> Add latency metrics with percentile support to Fetcher, Parser, and Indexer
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-3134
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3134
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher, indexer, parser
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.22
>
>
> This task involves adding timing metrics to the fetching, parsing and 
> indexing jobs. We could likely expand this to other jobs in the future but 
> this is a good start. The timing metrics should come with percentile support 
> using TDigest ([https://github.com/tdunning/t-digest)] which Nutch already 
> depends on. This would enable tracking fetch latency, parse latency, and 
> indexing latency with p50/p95/p99 insights exposed via Hadoop counters.
> Latency distributions will be useful for:
>  * Identifying performance bottlenecks in crawl jobs
>  * Tuning fetch/parse/index configurations 
>  * Detecting anomalies in processing times



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to