[ 
https://issues.apache.org/jira/browse/FLINK-34400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815253#comment-17815253
 ] 

Alexis Sarda-Espinosa edited comment on FLINK-34400 at 2/7/24 10:50 PM:
------------------------------------------------------------------------

Ah, let me clarify. The setup is roughly like this:
 * Source 1 with parallelism 2 consuming from topic A that continuously 
receives messages
 ** I have 2 Task Managers, so each source reader should be consuming 
approximately 15 partitions in each TM.
 * Source 2 with parallelism 1 consuming from topic B that rarely receives 
messages and has stayed mostly empty during my experiments

Both sources were assigned to the same alignment group.

So, what I meant is that Source 1 is showing lag in only _one_ of its readers, 
and the corresponding error logs only show in 1 TM, the other reader and its TM 
never show errors. This is why the graph I posted earlier shows the lag is 
slightly reduced at one step, but then increases a lot more in the next step 
(instead of showing increasing steps all the time).

On the other hand, why would lag start showing up only after 15 minutes or so?

I will probably enable idleness anyway, but I was testing both scenarios and I 
find these inconsistencies kind of unexpected.


was (Author: asardaes):
Ah, let me clarify. The setup is roughly like this:
 * Source 1 with parallelism 2 consuming from topic A that continuously 
receives messages
 ** I have 2 Task Managers, so each source reader should be consuming 
approximately 15 partitions in each TM.
 * Source 2 with parallelism 1 consuming from topic B that rarely receives 
messages and has stayed mostly empty during my experiments

Both sources were assigned to the same alignment group.

So, what I meant is that Source 1 is showing lag in only _one_ of its readers, 
and the corresponding error logs only show in 1 TM, the other reader and its TM 
never show errors.

On the other hand, why would lag start showing up only after 15 minutes or so?

I will probably enable idleness anyway, but I was testing both scenarios and I 
find these inconsistencies kind of unexpected.

> Kafka sources with watermark alignment sporadically stop consuming
> ------------------------------------------------------------------
>
>                 Key: FLINK-34400
>                 URL: https://issues.apache.org/jira/browse/FLINK-34400
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.18.1
>            Reporter: Alexis Sarda-Espinosa
>            Priority: Major
>         Attachments: alignment_lags.png, logs.txt
>
>
> I have 2 Kafka sources that read from different topics. I have assigned them 
> to the same watermark alignment group, and I have _not_ enabled idleness 
> explicitly in their watermark strategies. One topic remains pretty much empty 
> most of the time, while the other receives a few events per second all the 
> time. Parallelism of the active source is 2, for the other one it's 1, and 
> checkpoints are once every minute.
> This works correctly for some time (10 - 15 minutes in my case) but then 1 of 
> the active sources stops consuming, which causes lag to increase. Weirdly, 
> after another 15 minutes or so, all the backlog is consumed at once, and then 
> everything stops again.
> I'm attaching some logs from the Task Manager where the issue appears. You 
> will notice that the Kafka network client reports disconnections (a long time 
> after the deserializer stopped reporting that events were being consumed), 
> I'm not sure if this is related.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to