Hi, Siva. Additionally, you can temporarily set the Flink configuration `pipeline.operator-chaining: false` to unchain all operators. This will allow you to see if one specific operator is particularly busy, which could cause backpressure for all of its inputs.
-- Best! Xuyang At 2025-06-24 11:00:49, "Matt Cuento" <cuentom...@gmail.com> wrote: >Hi Siva, > >Unfortunately the picture attached does not render for me. Would you be >able to send the output of what an `EXPLAIN` statement reveals as your >logical plan? This is a good first step to getting an idea of what each >operator is doing. > >Degraded performance over time/after processing many records sounds like it >could be a state size issue. Knowing what operators are used may help us >know how much state is being stored. There are metrics around state size >[1] as well as system resources [2]. > >Matt Cuento >cuentom...@gmail.com > >[1] >https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-resources >[2] >https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#state-size > > >On Mon, Jun 23, 2025 at 4:25 PM Siva Ram Gujju <sivaram2...@gmail.com> >wrote: > >> Hello, >> I'm new to Flink and doing a POC. I have a Flink job which reads events >> from a kafka source Topic, performs some calculations and outputs a couple >> of SQL sinks. >> I deployed this to a stand alone cluster running on my linux virtual >> machine (all default settings). >> >> Parallelism=3 >> NoOfTaskSlots allowed in config.yml=10 >> NoOfTaskSlots required for my job=3 >> Rest of the settings are default. >> >> The job runs fine for the first 100,000 event and the response is near >> real time. After that the first operator of the job starts to show Busy >> (max): 100% and the processing slows down significantly (see below picture). >> Heap is at 50%. >> Source Lag (kafka consumers lag) is 0. Source Kafka cluster CPU is <3%. >> >> >> 1. How can I triage what is causing slowness? Is it a CPU or Memory issue, >> how do I find it? Everything looks normal to me. No exceptions in logs. >> 2. Why did the job run fine for the 100K event super fast and started >> slowing down? Any theory on this? >> Please suggest. Thank you! >> >> >> [image: Picture 1, Picture] >>