[ 
https://issues.apache.org/jira/browse/FLINK-34096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807185#comment-17807185
 ] 

Yang LI commented on FLINK-34096:
---------------------------------

Hello [~dmvk] [~isburmistrov] I have uploaded following logs 

1. logs of souce split : [^source-split-log.txt]
2. logs of remote state SubtaskState [^substate-log.txt]
3. global configuration [^global-configuration-log.txt]

Env conf:
flink 18 + autoscaler
source vertex parallelism 12
following vertex parallelism 18
kafka partition 72

Action to produce this scenario 
    1. scaling source vertex parallelism from 12 to 48 manuelly via scale 
button on flink UI
    2. scaled from 5 taskmanagers to 8 taskmanagers


You can see we can split for partition sf-anonymized-13 duplicated for 
taskmanager 9, 11, 10, 6

> Upscale of kafka source operator leads to some splits getting lost
> ------------------------------------------------------------------
>
>                 Key: FLINK-34096
>                 URL: https://issues.apache.org/jira/browse/FLINK-34096
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.18.0
>            Reporter: Yang LI
>            Priority: Critical
>         Attachments: global-configuration-log.txt, 
> image-2024-01-15-15-46-47-104.png, image-2024-01-15-15-47-36-509.png, 
> image-2024-01-15-15-48-07-871.png, source-split-log.txt, substate-log.txt
>
>
> Hello,
> We've been conducting experiments with Autoscaling in Apache Flink version 
> 1.18.0 and encountered a bug associated with the Kafka source split.
> The issue manifested in our system as follows: upon experiencing a sudden 
> spike in traffic, the autoscaler opted to upscale the Kafka source vertex. 
> However, the Kafka source fetcher failed to retrieve all available Kafka 
> partitions. Additionally, we observed duplication in source splits. For 
> example, taskmanager-1 and taskmanager-4 both fetched the same Kafka 
> partition.
> !image-2024-01-15-15-46-47-104.png|width=423,height=381!
> !image-2024-01-15-15-48-07-871.png|width=719,height=329!
> {noformat}
> taskmanager-9 2024-01-09 17:59:37 [Source: kafka_source_input_with_kt -> Flat 
> Map -> session_valid (5/18)#6] INFO  o.a.f.c.b.s.r.SourceReaderBase - Adding 
> split(s) to reader: [[Partition: sf-enriched-4, StartingOffset: 26084169, 
> StoppingOffset: -9223372036854775808], [Partition: sf-anonymized-8, 
> StartingOffset: 46477069, StoppingOffset: -9223372036854775808], [Partition: 
> sf-anonymized-9, StartingOffset: 46121324, StoppingOffset: 
> -9223372036854775808], [Partition: sf-enriched-5, StartingOffset: 26221751, 
> StoppingOffset: -9223372036854775808]] 
> taskmanager-6 2024-01-09 17:59:37 [Source: kafka_source_input_with_kt -> Flat 
> Map -> session_valid (4/18)#6] INFO  o.a.f.c.b.s.r.SourceReaderBase - Adding 
> split(s) to reader: [[Partition: sf-enriched-4, StartingOffset: 26084169, 
> StoppingOffset: -9223372036854775808], [Partition: sf-anonymized-8, 
> StartingOffset: 46477069, StoppingOffset: -9223372036854775808], [Partition: 
> sf-anonymized-20, StartingOffset: 46211745, StoppingOffset: 
> -9223372036854775808], [Partition: sf-anonymized-32, StartingOffset: 
> 46340878, StoppingOffset: -9223372036854775808]] {noformat}
> You can see in these logs, taskmanager-9 and taskmanager-6 has both fetched 
> partition sf-enriched-4 and sf-anonymized-8
> Additional Questions
>  * During some other experiments which also lead to kafka partition issues, 
> we noticed that the autoscaler attempted to increase the parallelism of the 
> source vertex to a value that is not a divisor of the Kafka topic's partition 
> count. For example, it recommended a parallelism of 48 when the total 
> partition count was 72. In such scenarios: 
>  * 
>  ** Does kafka source connector vertex still suppose to works well when its 
> parallelism is not divisor of topic's partition count?
>  ** If this configuration is not ideal, should there be a mechanism within 
> the autoscaler to ensure that the parallelism of the source vertex always 
> matches the topic's partition count?
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to