xichen01 commented on PR #6776:
URL: https://github.com/apache/ozone/pull/6776#issuecomment-2154543779

   @kerneltime Thanks for your response.
   
   > Why do we need to filter the metrics for top 10 here and add a lot of 
processing in the data path? Grafana can sort and showcase top 10. Why don't we 
just add the datanode label for each latency measured from the client. This 
way, we can compare client side view of latency on a per datanode basis even if 
they are below a certain threshold? It might be over all simpler to capture all 
the data in the client and let grafana sort through and present what we want 
instead of adding the threshold logic here.
   
   - A Cluster may have a lot of Datanodes, if we export metrics for all 
Datanodes, this will generate a very large number of metrics. Just exporting 
`opsLatency` Metrics for `GetBlock`, `PutBlock`, `WriteChunk` and `ReadChunk` 
may generate thousands of Metrics. And most of it is probably of no concern.
   
   - TopN metrics only record the number of metrics that take longer than a 
threshold to execute, these values are more useful and make it easier to detect 
long-tail problems. If only the average value is used, the long-tail latency 
will be averaged, and it will be difficult to detect long-tail latency problems 
by averaging the metrics.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to