[
https://issues.apache.org/jira/browse/FLINK-34400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815253#comment-17815253
]
Alexis Sarda-Espinosa edited comment on FLINK-34400 at 2/7/24 11:27 AM:
------------------------------------------------------------------------
Ah, let me clarify. The setup is roughly like this:
* Source 1 with parallelism 2 consuming from topic A that continuously
receives messages
** I have 2 Task Managers, so each source reader should be consuming
approximately 15 partitions in each TM.
* Source 2 with parallelism 1 consuming from topic B that rarely receives
messages and has stayed mostly empty during my experiments
Both sources were assigned to the same alignment group.
So, what I meant is that Source 1 is showing lag in only _one_ of its readers,
and the corresponding error logs only show in 1 TM, the other reader and its TM
never show errors.
On the other hand, why would lag start showing up only after 15 minutes or so?
I will probably enable idleness anyway, but I was testing both scenarios and I
find these inconsistencies kind of unexpected.
was (Author: asardaes):
Ah, let me clarify. The setup is roughly like this:
* Source 1 with parallelism 2 consuming from topic A that continuously
receives messages
** I have 2 Task Managers, so each source reader should be consuming
approximately 15 partitions in each TM.
* Source 2 with parallelism 1 consuming from topic B that rarely receives
messages and has stayed mostly empty during my experiments
Both sources were assigned to the same alignment group.
So, what I meant is that Source 1 is showing lag in only _one_ of its readers,
and the corresponding error logs only show in 1 TM, the other reader and its TM
never show errors.
On the other hand, why would lag start showing up only after 15 minutes or so?
I will probably enable idleness anyway, but I was testing both scenarios and I
thing these inconsistencies are kind of unexpected.
> Kafka sources with watermark alignment sporadically stop consuming
> ------------------------------------------------------------------
>
> Key: FLINK-34400
> URL: https://issues.apache.org/jira/browse/FLINK-34400
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.18.1
> Reporter: Alexis Sarda-Espinosa
> Priority: Major
> Attachments: alignment_lags.png, logs.txt
>
>
> I have 2 Kafka sources that read from different topics. I have assigned them
> to the same watermark alignment group, and I have _not_ enabled idleness
> explicitly in their watermark strategies. One topic remains pretty much empty
> most of the time, while the other receives a few events per second all the
> time. Parallelism of the active source is 2, for the other one it's 1, and
> checkpoints are once every minute.
> This works correctly for some time (10 - 15 minutes in my case) but then 1 of
> the active sources stops consuming, which causes lag to increase. Weirdly,
> after another 15 minutes or so, all the backlog is consumed at once, and then
> everything stops again.
> I'm attaching some logs from the Task Manager where the issue appears. You
> will notice that the Kafka network client reports disconnections (a long time
> after the deserializer stopped reporting that events were being consumed),
> I'm not sure if this is related.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)