[ 
https://issues.apache.org/jira/browse/KAFKA-15792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Pang updated KAFKA-15792:
---------------------------------
    Description: 
Our Kafka Streams process often show slow in processing a particular partition 
on a specific instance. No data skew is detected, i.e. partitions are mostly 
uniformly distributed. Symptom is huge lag on a specific partition. We do 
notice network out is higher on problematic process than normal process, often 
at 3x

After restarting the process, the lag drains within 5 minutes after startup. 
This hints at internal processing issue of our streams application instead of 
cluster or poison message. 

Is there any metrics you suggest for us to look at, or is this a known issue? 
Regularly bouncing the application doesn't look like a proper fix for 
production systems.

  was:
Our Kafka Streams process often show slow in processing a particular partition 
on a specific instance. No data skew is detected, i.e. partitions are mostly 
uniformly distributed. Symptom is huge lag on a specific partition.

After restarting the process, the lag drains within 5 minutes after startup. 
This hints at internal processing issue of our streams application instead of 
cluster or poison message.

Is there any metrics you suggest for us to look at, or is this a known issue? 
Regularly bouncing the application doesn't look like a proper fix for 
production systems.


> Kafka Streams stuck partition fixed after restarting the process
> ----------------------------------------------------------------
>
>                 Key: KAFKA-15792
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15792
>             Project: Kafka
>          Issue Type: New Feature
>          Components: streams
>    Affects Versions: 3.1.2
>            Reporter: Patrick Pang
>            Priority: Major
>
> Our Kafka Streams process often show slow in processing a particular 
> partition on a specific instance. No data skew is detected, i.e. partitions 
> are mostly uniformly distributed. Symptom is huge lag on a specific 
> partition. We do notice network out is higher on problematic process than 
> normal process, often at 3x
> After restarting the process, the lag drains within 5 minutes after startup. 
> This hints at internal processing issue of our streams application instead of 
> cluster or poison message. 
> Is there any metrics you suggest for us to look at, or is this a known issue? 
> Regularly bouncing the application doesn't look like a proper fix for 
> production systems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to