[jira] [Updated] (FLINK-32127) Source busy time is inaccurate in many cases

Zhanghao Chen (Jira) Wed, 17 May 2023 23:21:05 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-32127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zhanghao Chen updated FLINK-32127:
----------------------------------
    Description: 
We found that source busy time is inaccurate in many cases. The reason is that 
sources are usu. multi-threaded (Kafka and RocketMq for example), there is a 
fetcher thread fetching data from data source, and a consumer thread 
deserializes data with an blocking queue in between. A source is considered 
 # *idle* if the consumer is blocked by fetching data from the queue
 # *backpressured* if the consumer is blocked by writing data to downstream 
operators
 # *busy* otherwise

However, this means that if the bottleneck is on the fetcher side, the consumer 
will be often blocked by fetching data from the queue, the source idle time 
would be high, but in fact it is busy and consumes a lot of CPU. In some of our 
jobs, the source max busy time is only ~600 ms while it has actually reached 
the limit.

The bottleneck could be on the fetcher side, for example, when Kafka enables 
zstd compression, uncompression on the consumer side could be quite heavy 
compared to data deserialization on the consumer thread side.

  was:
We found that source busy time is inaccurate in many cases. The reason is that 
sources are usu. multi-threaded (Kafka and RocketMq for example), there is a 
fetcher thread fetching data from data source, and a consumer thread 
deserializes data with an blocking queue in between. A source is considered 
 # *idle* if the consumer is blocked by fetching data from the queue
 # *backpressured* if the consumer is blocked by writing data to downstream 
operators
 # *busy* otherwise

However, this means that if the bottleneck is on the fetcher side, the consumer 
will be often blocked by fetching data from the queue, the source idle time 
would be high, but in fact it is busy and consumes a lot of CPU. In some of our 
jobs, the source max busy time is only ~600 ms while it has actually reached 
the limit.


> Source busy time is inaccurate in many cases
> --------------------------------------------
>
>                 Key: FLINK-32127
>                 URL: https://issues.apache.org/jira/browse/FLINK-32127
>             Project: Flink
>          Issue Type: Improvement
>          Components: Autoscaler
>            Reporter: Zhanghao Chen
>            Priority: Major
>
> We found that source busy time is inaccurate in many cases. The reason is 
> that sources are usu. multi-threaded (Kafka and RocketMq for example), there 
> is a fetcher thread fetching data from data source, and a consumer thread 
> deserializes data with an blocking queue in between. A source is considered 
>  # *idle* if the consumer is blocked by fetching data from the queue
>  # *backpressured* if the consumer is blocked by writing data to downstream 
> operators
>  # *busy* otherwise
> However, this means that if the bottleneck is on the fetcher side, the 
> consumer will be often blocked by fetching data from the queue, the source 
> idle time would be high, but in fact it is busy and consumes a lot of CPU. In 
> some of our jobs, the source max busy time is only ~600 ms while it has 
> actually reached the limit.
> The bottleneck could be on the fetcher side, for example, when Kafka enables 
> zstd compression, uncompression on the consumer side could be quite heavy 
> compared to data deserialization on the consumer thread side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-32127) Source busy time is inaccurate in many cases

Reply via email to