[ 
https://issues.apache.org/jira/browse/FLINK-32127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17724124#comment-17724124
 ] 

Zhanghao Chen commented on FLINK-32127:
---------------------------------------

Hi [~Wencong Liu]. I think the key problem here is that busy time is 
well-defined for single-threaded computation, but ill-defined for 
multi-threaded computation. As is pointed out in this 
[blog|[https://flink.apache.org/2021/07/07/how-to-identify-the-source-of-backpressure/#what-are-those-numbers]],
 {{busyTimeMsPerSecond}} and {{idleTimeMsPerSecond}} metrics are oblivious to 
anything that is happening in separate threads, outside of the main subtask’s 
execution loop. I'm not sure if there exist a way to somewhat solve it for 
sources, but maybe we'd better document it 
[here|https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/monitoring/back_pressure/#task-performance-metrics]
 first.

> Source busy time is inaccurate in many cases
> --------------------------------------------
>
>                 Key: FLINK-32127
>                 URL: https://issues.apache.org/jira/browse/FLINK-32127
>             Project: Flink
>          Issue Type: Improvement
>          Components: Autoscaler
>            Reporter: Zhanghao Chen
>            Priority: Major
>
> We found that source busy time is inaccurate in many cases. The reason is 
> that sources are usu. multi-threaded (Kafka and RocketMq for example), there 
> is a fetcher thread fetching data from data source, and a consumer thread 
> deserializes data with an blocking queue in between. A source is considered 
>  # *idle* if the consumer is blocked by fetching data from the queue
>  # *backpressured* if the consumer is blocked by writing data to downstream 
> operators
>  # *busy* otherwise
> However, this means that if the bottleneck is on the fetcher side, the 
> consumer will be often blocked by fetching data from the queue, the source 
> idle time would be high, but in fact it is busy and consumes a lot of CPU. In 
> some of our jobs, the source max busy time is only ~600 ms while it has 
> actually reached the limit.
> The bottleneck could be on the fetcher side, for example, when Kafka enables 
> zstd compression, uncompression on the consumer side could be quite heavy 
> compared to data deserialization on the consumer thread side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to