Hi, Siva.

Additionally, you can temporarily set the Flink configuration 
`pipeline.operator-chaining: false` to unchain all operators. This will allow 
you to see if one specific operator is particularly busy, which could cause 
backpressure for all of its inputs.




--

    Best!
    Xuyang





At 2025-06-24 11:00:49, "Matt Cuento" <cuentom...@gmail.com> wrote:
>Hi Siva,
>
>Unfortunately the picture attached does not render for me. Would you be
>able to send the output of what an `EXPLAIN` statement reveals as your
>logical plan? This is a good first step to getting an idea of what each
>operator is doing.
>
>Degraded performance over time/after processing many records sounds like it
>could be a state size issue. Knowing what operators are used may help us
>know how much state is being stored. There are metrics around state size
>[1] as well as system resources [2].
>
>Matt Cuento
>cuentom...@gmail.com
>
>[1]
>https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-resources
>[2]
>https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#state-size
>
>
>On Mon, Jun 23, 2025 at 4:25 PM Siva Ram Gujju <sivaram2...@gmail.com>
>wrote:
>
>> Hello,
>> I'm new to Flink and doing a POC. I have a Flink job which reads events
>> from a kafka source Topic, performs some calculations and outputs a couple
>> of SQL sinks.
>> I deployed this to a stand alone cluster running on my linux virtual
>> machine (all default settings).
>>
>> Parallelism=3
>> NoOfTaskSlots allowed in config.yml=10
>> NoOfTaskSlots required for my job=3
>> Rest of the settings are default.
>>
>> The job runs fine for the first 100,000 event and the response is near
>> real time. After that the first operator of the job starts to show Busy
>> (max): 100% and the processing slows down significantly (see below picture).
>> Heap is at 50%.
>> Source Lag (kafka consumers lag) is 0. Source Kafka cluster CPU is <3%.
>>
>>
>> 1. How can I triage what is causing slowness? Is it a CPU or Memory issue,
>> how do I find it? Everything looks normal to me. No exceptions in logs.
>> 2. Why did the job run fine for the 100K event super fast and started
>> slowing down? Any theory on this?
>> Please suggest. Thank you!
>>
>>
>> [image: Picture 1, Picture]
>>

Reply via email to