Re: Help - Question on triaging slowness

Matt Cuento Tue, 24 Jun 2025 00:17:32 -0700

Hi Siva,

Unfortunately the picture attached does not render for me. Would you be
able to send the output of what an `EXPLAIN` statement reveals as your
logical plan? This is a good first step to getting an idea of what each
operator is doing.


Degraded performance over time/after processing many records sounds like it
could be a state size issue. Knowing what operators are used may help us
know how much state is being stored. There are metrics around state size
[1] as well as system resources [2].

Matt Cuento
cuentom...@gmail.com

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#system-resources
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#state-size


On Mon, Jun 23, 2025 at 4:25 PM Siva Ram Gujju <sivaram2...@gmail.com>
wrote:

> Hello,
> I'm new to Flink and doing a POC. I have a Flink job which reads events
> from a kafka source Topic, performs some calculations and outputs a couple
> of SQL sinks.
> I deployed this to a stand alone cluster running on my linux virtual
> machine (all default settings).
>
> Parallelism=3
> NoOfTaskSlots allowed in config.yml=10
> NoOfTaskSlots required for my job=3
> Rest of the settings are default.
>
> The job runs fine for the first 100,000 event and the response is near
> real time. After that the first operator of the job starts to show Busy
> (max): 100% and the processing slows down significantly (see below picture).
> Heap is at 50%.
> Source Lag (kafka consumers lag) is 0. Source Kafka cluster CPU is <3%.
>
>
> 1. How can I triage what is causing slowness? Is it a CPU or Memory issue,
> how do I find it? Everything looks normal to me. No exceptions in logs.
> 2. Why did the job run fine for the 100K event super fast and started
> slowing down? Any theory on this?
> Please suggest. Thank you!
>
>
> [image: Picture 1, Picture]
>

Re: Help - Question on triaging slowness

Reply via email to