[
https://issues.apache.org/jira/browse/FLINK-34096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yang LI updated FLINK-34096:
----------------------------
Attachment: global-configuration-log.txt
source-split-log.txt
substate-log.txt
> Upscale of kafka source operator leads to some splits getting lost
> ------------------------------------------------------------------
>
> Key: FLINK-34096
> URL: https://issues.apache.org/jira/browse/FLINK-34096
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.18.0
> Reporter: Yang LI
> Priority: Critical
> Attachments: global-configuration-log.txt,
> image-2024-01-15-15-46-47-104.png, image-2024-01-15-15-47-36-509.png,
> image-2024-01-15-15-48-07-871.png, source-split-log.txt, substate-log.txt
>
>
> Hello,
> We've been conducting experiments with Autoscaling in Apache Flink version
> 1.18.0 and encountered a bug associated with the Kafka source split.
> The issue manifested in our system as follows: upon experiencing a sudden
> spike in traffic, the autoscaler opted to upscale the Kafka source vertex.
> However, the Kafka source fetcher failed to retrieve all available Kafka
> partitions. Additionally, we observed duplication in source splits. For
> example, taskmanager-1 and taskmanager-4 both fetched the same Kafka
> partition.
> !image-2024-01-15-15-46-47-104.png|width=423,height=381!
> !image-2024-01-15-15-48-07-871.png|width=719,height=329!
> {noformat}
> taskmanager-9 2024-01-09 17:59:37 [Source: kafka_source_input_with_kt -> Flat
> Map -> session_valid (5/18)#6] INFO o.a.f.c.b.s.r.SourceReaderBase - Adding
> split(s) to reader: [[Partition: sf-enriched-4, StartingOffset: 26084169,
> StoppingOffset: -9223372036854775808], [Partition: sf-anonymized-8,
> StartingOffset: 46477069, StoppingOffset: -9223372036854775808], [Partition:
> sf-anonymized-9, StartingOffset: 46121324, StoppingOffset:
> -9223372036854775808], [Partition: sf-enriched-5, StartingOffset: 26221751,
> StoppingOffset: -9223372036854775808]]
> taskmanager-6 2024-01-09 17:59:37 [Source: kafka_source_input_with_kt -> Flat
> Map -> session_valid (4/18)#6] INFO o.a.f.c.b.s.r.SourceReaderBase - Adding
> split(s) to reader: [[Partition: sf-enriched-4, StartingOffset: 26084169,
> StoppingOffset: -9223372036854775808], [Partition: sf-anonymized-8,
> StartingOffset: 46477069, StoppingOffset: -9223372036854775808], [Partition:
> sf-anonymized-20, StartingOffset: 46211745, StoppingOffset:
> -9223372036854775808], [Partition: sf-anonymized-32, StartingOffset:
> 46340878, StoppingOffset: -9223372036854775808]] {noformat}
> You can see in these logs, taskmanager-9 and taskmanager-6 has both fetched
> partition sf-enriched-4 and sf-anonymized-8
> Additional Questions
> * During some other experiments which also lead to kafka partition issues,
> we noticed that the autoscaler attempted to increase the parallelism of the
> source vertex to a value that is not a divisor of the Kafka topic's partition
> count. For example, it recommended a parallelism of 48 when the total
> partition count was 72. In such scenarios:
> *
> ** Does kafka source connector vertex still suppose to works well when its
> parallelism is not divisor of topic's partition count?
> ** If this configuration is not ideal, should there be a mechanism within
> the autoscaler to ensure that the parallelism of the source vertex always
> matches the topic's partition count?
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)