Re: [DISCUSS] FLIP-537: Enumerator with Global Split Assignment Distribution for Balanced Split assignment

Becket Qin Sat, 18 Oct 2025 04:58:04 -0700

Hi Hongshun,

Thanks for updating the FLIP. It looks much cleaner. Some minor comments:


1. Rename the `splitsOnRecovery` in ReaderInfo to
`reportedSplitsOnRegistration`.
2. The SourceOperator should always send the ReaderRegistrationEvent with
the `reportedSplitsOnRegistration` list. But it will not add the splits to
readers if `SupportSplitReassignmentOnRecovery` is implemented.
3. "*Thus, If a connector want to use this FLIP-537, enumerator must keep a
unAssignedSplit in state*" - this is up to the enumerator implementation to
decide. It is not a must have for all the enumerator impl.
4. The RoundRobin algorithm in KafkaSource is not deterministic. Why not
just get all the splits and do a round robin based on the numReaders? E.g.
Sort all the splits, and assign the splits to reader 0, 1, 2, 3...N, 0, 1,
2, 3... N...

Thanks,

Jiangjie (Becket) Qin

On Tue, Oct 14, 2025 at 7:28 PM Hongshun Wang <[email protected]>
wrote:

> Hi devs,
>
> If there is no other problem, I will start a vote later.
>
> Best,
> Hongshun
>
> On Mon, Oct 13, 2025 at 4:17 PM Hongshun Wang <[email protected]>
> wrote:
>
>> Hi Becket and Leonard,
>>
>> It seems adding  `splitsOnRecovery` to `ReaderInfo` makes the split
>> enumerator simpler and cleaner.
>>
>> I have modified this FLIP again. Please have a look and let me know what
>> you think.
>>
>> Best,
>> Hongshun
>>
>> On Mon, Oct 13, 2025 at 10:48 AM Hongshun Wang <[email protected]>
>> wrote:
>>
>>> Hi Becket,
>>> Thanks for your explanation.
>>>
>>> > For the same three input above, the assignment should be consistently
>>> the same.
>>>
>>> That is exactly what troubles me. For *assignment algorithms such as
>>> hash, it does behave the same. What If we use round-robin? Each *the *reader
>>> information, the same split will be assigned to different readers. There is
>>> also what I used to list as an example.*
>>>
>>>    1. *Initial state:*: 2 parallelism, 2 splits.
>>>    2. *Enumerator action:*  Split 1 → Task 1, Split 2 → Task 2 ,  ,
>>>    3. *Failure scenario: *After Split 2 is assigned to Task 2 but
>>>    before next checkpoint success, task 1 restarts.
>>>    4. *Recovery issue:* Split 2 is re-added to the enumerator.
>>>    Round-robin strategy assigns Split 2 to Task 1. Then Task 1 now has 2
>>>    splits, Task 2 has 0 → Imbalanced distribution.
>>>
>>>
>>> > Please let me know if you think a meeting would be more efficient.
>>> Yes, I’d like to reach an agreement as soon as possible. If you’re
>>> available, we could schedule a meeting with Lenenord as well.
>>>
>>> Best,
>>> Hongshun
>>>
>>> On Sat, Oct 11, 2025 at 3:59 PM Becket Qin <[email protected]> wrote:
>>>
>>>> Hi Hongshun,
>>>>
>>>> I am confused. First of all, regardless of what the assignment
>>>> algorithm is. Using SplitEnumeratorContext to return the splits only gives
>>>> more information than using addSplitsBack(). So there should be no
>>>> regression.
>>>>
>>>> Secondly, at this point. The SplitEnumerator should only take the
>>>> following three input to generate the global splits assignment:
>>>> 1. the *reader information (num readers, locations, etc)*
>>>> 2. *all the splits to assign*
>>>> 3. *configured assignment algorithm *
>>>> Preferably, for the same three input above, the assignment should be
>>>> consistently the same. I don't see why it should care about why a new
>>>> reader is added, whether due to partial failover or global failover or job
>>>> restart.
>>>>
>>>> If you want to do global redistribution on global failover and restart,
>>>> but honor the existing assignment for partial failover. The enumerator will
>>>> just do the following:
>>>> 1. Generate a new global assignment (global redistribution) in start()
>>>> because start() will only be invoked in global failover or restart. That
>>>> means all the readers are also new with empty assignment.
>>>> 2. After the global assignment is generated, it should be honored for
>>>> the whole life cycle. there might be many reader registrations, again for
>>>> different reasons but does not matter:
>>>>     - reader registration after this job restart
>>>>     - reader registration after this global failover
>>>>     - reader registration due to partial failover which may or may not
>>>> have a addSplitsBack() call.
>>>>     Regardless of the reason, the split enumerator will just enforce
>>>> the global assignment it has already generated, i.e. without split
>>>> redistribution.
>>>>
>>>> Wouldn't that give the behavior you want? I feel the discussion somehow
>>>> goes to circles. Please let me know if you think a meeting would be more
>>>> efficient.
>>>>
>>>> Thanks,
>>>>
>>>> Jiangjie (Becket) Qin
>>>>
>>>> On Fri, Oct 10, 2025 at 7:58 PM Hongshun Wang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Becket,
>>>>>
>>>>> > Ignore a returned split if it has been assigned to a different
>>>>> reader, otherwise put it back to unassigned splits / pending splits. Then
>>>>> the enumerator assigns new splits to the newly added reader, which may use
>>>>> the previous assignment as a reference. This should work regardless of
>>>>> whether it is a global failover, partial failover, restart, etc. There is
>>>>> no need for the SplitEnumerator to distinguish what failover scenario it 
>>>>> is.
>>>>>
>>>>> In this case, it seems that global failover and partial failover share
>>>>> the same distribution strategy If it has not been assigned to a different
>>>>> reader. However, global failover needs to be redistributed(this is why we
>>>>> need this FLIP) , while partial failover is not. I have no idea how we
>>>>> distinguish them.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Best,
>>>>> Hongshun
>>>>>
>>>>> On Sat, Oct 11, 2025 at 12:54 AM Becket Qin <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Hongshun,
>>>>>>
>>>>>> The problem we are trying to solve here is to give the splits back to
>>>>>> the SplitEnumerator. There are only two types of splits to give back:
>>>>>> 1) splits whose assignment has been checkpointed. - In this case, we
>>>>>> rely on addReader() + SplitEnumeratorContext to give the splits back, 
>>>>>> this
>>>>>> provides more information associated with those splits.
>>>>>> 2) splits whose assignment has not been checkpointed. -  In this
>>>>>> case, we use addSplitsBack(), there is no reader info to give because the
>>>>>> previous assignment did not take effect to begin with.
>>>>>>
>>>>>> From the SplitEnumerator implementation perspective, the contract is
>>>>>> straightforward.
>>>>>> 1. The SplitEnumerator is the source of truth for assignment.
>>>>>> 2. When the enumerator receives the addSplits() call, it always add
>>>>>> these splits back to unassigned splits / pending splits.
>>>>>> 3. When the enumerator receives the addReader() call, that means the
>>>>>> reader has no current assignment, and has returned its previous 
>>>>>> assignment
>>>>>> based on the reader side info. The SplitEnumerator checks the
>>>>>> SplitEnumeratorContext to retrieve the returned splits from that reader
>>>>>> (i.e. previous assignment) and handle them according to its own source of
>>>>>> truth knowledge of assignment - Ignore a returned split if it has been
>>>>>> assigned to a different reader, otherwise put it back to unassigned 
>>>>>> splits
>>>>>> / pending splits. Then the enumerator assigns new splits to the newly 
>>>>>> added
>>>>>> reader, which may use the previous assignment as a reference. This should
>>>>>> work regardless of whether it is a global failover, partial failover,
>>>>>> restart, etc. There is no need for the SplitEnumerator to distinguish 
>>>>>> what
>>>>>> failover scenario it is.
>>>>>>
>>>>>> Would this work?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jiangjie (Becket) Qin
>>>>>>
>>>>>> On Fri, Oct 10, 2025 at 1:28 AM Hongshun Wang <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Becket,
>>>>>>>  > why do we need to change the behavior of addSplitsBack()? Should
>>>>>>> it remain the same?
>>>>>>>
>>>>>>> How does the enumerator get the splits from ReaderRegistrationEvent
>>>>>>> and then reassign it?
>>>>>>>
>>>>>>> You have given a advice before:
>>>>>>> > 1. Put all the reader information in the SplitEnumerator context.
>>>>>>> 2. notify the enumerator about the new reader registration. 3. let the
>>>>>>> split enumerator get whatever information it wants from the context and 
>>>>>>> do
>>>>>>> its job.
>>>>>>>
>>>>>>> However, each time a source task fails over, the
>>>>>>> ConcurrentMap<Integer, ConcurrentMap<Integer, ReaderInfo>>
>>>>>>> registeredReaders will remove this reader infos. When the source task is
>>>>>>> registered again, it will be added again. *Thus, registeredReaders
>>>>>>> cannot know whether is registered before. *
>>>>>>>
>>>>>>> Therefore, registeredReaders enumerator#addReader does not
>>>>>>> distinguish the following situations:
>>>>>>> However, each time one source task is failover. The
>>>>>>> `ConcurrentMap<Integer, ConcurrentMap<Integer, ReaderInfo>>
>>>>>>> registeredReaders` will remove this source. When source Task is 
>>>>>>> registered
>>>>>>> again, enumerator#addReader not distinguished three situations:
>>>>>>> 1. The Reader is registered when the global restart. In this case,
>>>>>>> redistribution the split from the infos. (take off all the splits from
>>>>>>> ReaderInfo).
>>>>>>> 2. The Reader is registered when a partial failover(before the first
>>>>>>> successful checkpoint). In this case,  ignore the split from the infos.
>>>>>>> (leave alone all the splits from ReaderInfo).
>>>>>>> 3. The Reader is registered when a partial failover(after the first
>>>>>>> successful checkpoint).In this case, we need assign the split to same
>>>>>>> reader again. (take off all the splits from ReaderInfo but assigned to 
>>>>>>> it
>>>>>>> again).
>>>>>>> we still need the enumerator to distinguish them (using
>>>>>>> pendingSplitAssignment & assignedSplitAssignment. However, it is 
>>>>>>> redundant
>>>>>>> to maintain split assigned information both in the enumerator and the
>>>>>>> enumerator context.
>>>>>>>
>>>>>>> I think if we change the behavior of addSplitsBack, it will be more
>>>>>>> simple. Just let the enumerator to handle these split based on 
>>>>>>> pendingSplitAssignment
>>>>>>> & assignedSplitments.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Hongshun
>>>>>>>
>>>>>>> On Fri, Oct 10, 2025 at 12:55 PM Becket Qin <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Hongshun,
>>>>>>>>
>>>>>>>> Thanks for updating the FLIP. A quick question: why do we need to
>>>>>>>> change the behavior of addSplitsBack()? Should it remain the same?
>>>>>>>>
>>>>>>>> Regarding the case of restart with changed subscription. I think
>>>>>>>> the only correct behavior is removing obsolete splits without any 
>>>>>>>> warning /
>>>>>>>> exception. It is OK to add an info level logging if we want to. It is a
>>>>>>>> clear intention if the user has explicitly changed subscription and
>>>>>>>> restarted the job. There is no need to add a config to double confirm.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>
>>>>>>>> On Thu, Oct 9, 2025 at 7:28 PM Hongshun Wang <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Leonard,
>>>>>>>>>
>>>>>>>>> If the SplitEnumerator received all splits after a restart, it
>>>>>>>>> becomes straightforward to clear and un-assign the unmatched
>>>>>>>>> splits(checking whether matches the source options). However, a key
>>>>>>>>> question arises: *should  automatically discard obsolete splits,
>>>>>>>>> or explicitly notify the user via an exception?*
>>>>>>>>>
>>>>>>>>> We provided a option `scan.partition-unsubscribe.strategy`:
>>>>>>>>> 1. If Strict, throws an exception when encountering removed
>>>>>>>>> splits.
>>>>>>>>> 2. If Lenient, automatically removes obsolete splits silently.
>>>>>>>>>
>>>>>>>>> What Do you think?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Hongshun
>>>>>>>>>
>>>>>>>>> On Thu, Oct 9, 2025 at 9:37 PM Leonard Xu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks hongshun for the updating and pretty detailed analysis for
>>>>>>>>>> edge cases,  the updated FLIP looks good to me now.
>>>>>>>>>>
>>>>>>>>>> Only last implementation details about scenario in motivation
>>>>>>>>>> section:
>>>>>>>>>>
>>>>>>>>>> *Restart with Changed subscription: During restart, if source
>>>>>>>>>> options remove a topic or table. The splits which have already 
>>>>>>>>>> assigned can
>>>>>>>>>> not be removed.*
>>>>>>>>>>
>>>>>>>>>> Could you clarify how we resolve this in Kafka connector ?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Leonard
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2025 10月 9 19:48，Hongshun Wang <[email protected]> 写道：
>>>>>>>>>>
>>>>>>>>>> Hi devs,
>>>>>>>>>> If there are no further suggestions, I will start the voting
>>>>>>>>>> tomorrow。
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Hongshun
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 26, 2025 at 7:48 PM Hongshun Wang <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Becket and Leonard,
>>>>>>>>>>>
>>>>>>>>>>> I have updated the content of this FLIP. The key point is that:
>>>>>>>>>>>
>>>>>>>>>>> When the split enumerator receives a split, *these splits must
>>>>>>>>>>> have already existed in pendingSplitAssignment or 
>>>>>>>>>>> assignedSplitments*
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>>    - If the split is in pendingSplitAssignments, ignore it.
>>>>>>>>>>>    - If the split is in assignedSplitAssignments but has a
>>>>>>>>>>>    different taskId, ignore it (this indicates it was already
>>>>>>>>>>>    assigned to another task).
>>>>>>>>>>>    - If the split is in assignedSplitAssignments and shares the
>>>>>>>>>>>    same taskId, move the assignment from assignedSplitments to 
>>>>>>>>>>> pendingSplitAssignment
>>>>>>>>>>>    to re-assign again.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For better understanding why use these strategies. I added some
>>>>>>>>>>> examples and pictures to show it.
>>>>>>>>>>>
>>>>>>>>>>> Would you like to help me check whether there are still some
>>>>>>>>>>> problems?
>>>>>>>>>>>
>>>>>>>>>>> Best
>>>>>>>>>>> Hongshun
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 26, 2025 at 5:08 PM Leonard Xu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Becket and Hongshun for the insightful discussion.
>>>>>>>>>>>>
>>>>>>>>>>>> The underlying implementation and communication mechanisms of
>>>>>>>>>>>> Flink Source indeed involve many intricate details, we discussed 
>>>>>>>>>>>> the issue
>>>>>>>>>>>> of splits re-assignment in specific scenarios, but fortunately, 
>>>>>>>>>>>> the final
>>>>>>>>>>>> decision turned out to be pretty clear.
>>>>>>>>>>>>
>>>>>>>>>>>>  +1 to Becket’s proposal to keeps the framework cleaner and
>>>>>>>>>>>> more flexible.
>>>>>>>>>>>> +1 to Hongshun’s point to provide comprehensive guidance for
>>>>>>>>>>>> connector developers.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Leonard
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2025 9月 26 16:30，Hongshun Wang <[email protected]> 写道：
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>>
>>>>>>>>>>>> I Got it. You’re suggesting we should not handle this in the
>>>>>>>>>>>> source framework but instead let the split enumerator manage these 
>>>>>>>>>>>> three
>>>>>>>>>>>> scenarios.
>>>>>>>>>>>>
>>>>>>>>>>>> Let me explain why I originally favored handling it in the
>>>>>>>>>>>> framework: I'm concerned that connector developers might overlook 
>>>>>>>>>>>> certain
>>>>>>>>>>>> edge cases (after all, we even payed extensive discussions to 
>>>>>>>>>>>> fully clarify
>>>>>>>>>>>> the logic)
>>>>>>>>>>>>
>>>>>>>>>>>> However, your point keeps the framework cleaner and more
>>>>>>>>>>>> flexible. Thus, I will take it.
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps, in this FLIP, we should focus on providing
>>>>>>>>>>>> comprehensive guidance for connector developers: explain how
>>>>>>>>>>>> to implement a split enumerator, including the underlying 
>>>>>>>>>>>> challenges and
>>>>>>>>>>>> their solutions.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Additionally, we can use the Kafka connector as a reference
>>>>>>>>>>>> implementation to demonstrate the practical steps. This way, 
>>>>>>>>>>>> developers who
>>>>>>>>>>>> want to implement similar connectors can directly reference this 
>>>>>>>>>>>> example.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Hongshun
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 26, 2025 at 1:27 PM Becket Qin <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> It would be good to not expose runtime details to the source
>>>>>>>>>>>>> implementation if possible.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Today, the split enumerator implementations are expected to
>>>>>>>>>>>>> track the split assignment.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Assuming the split enumerator implementation keeps a split
>>>>>>>>>>>>> assignment map, that means the enumerator should already know 
>>>>>>>>>>>>> whether a
>>>>>>>>>>>>> split is assigned or unassigned. So it can handle the three 
>>>>>>>>>>>>> scenarios you
>>>>>>>>>>>>> mentioned.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The split is reported by a reader during a global restoration.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> The split enumerator should have just been restored / created.
>>>>>>>>>>>>> If the enumerator expects a full reassignment of splits up on 
>>>>>>>>>>>>> global
>>>>>>>>>>>>> recovery, there should be no assigned splits to that reader in 
>>>>>>>>>>>>> the split
>>>>>>>>>>>>> assignment mapping.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The split is reported by a reader during a partial failure
>>>>>>>>>>>>>> recovery.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> In this case, when SplitEnumerator.addReader() is invoked, the
>>>>>>>>>>>>> split assignment map in the enumerator implementation should 
>>>>>>>>>>>>> already have
>>>>>>>>>>>>> some split assignments for the reader. Therefore it is a partial 
>>>>>>>>>>>>> failover.
>>>>>>>>>>>>> If the source supports split reassignment on recovery, the 
>>>>>>>>>>>>> enumerator can
>>>>>>>>>>>>> assign splits that are different from the reported assignment of 
>>>>>>>>>>>>> that
>>>>>>>>>>>>> reader in the SplitEnumeratorContext, or it can also assign the 
>>>>>>>>>>>>> same
>>>>>>>>>>>>> splits. In any case, the enumerator knows that this is a partial 
>>>>>>>>>>>>> recovery
>>>>>>>>>>>>> because the assignment map is non-empty.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The split is not reported by a reader, but is assigned after
>>>>>>>>>>>>>> the last successful checkpoint and was never acknowledged.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is actually one of the step in the partial failure
>>>>>>>>>>>>> recover. SplitEnumerator.addSplitsBack() will be called first 
>>>>>>>>>>>>> before
>>>>>>>>>>>>> SplitReader.addReader() is called for the recovered reader. When 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> SplitEnumerator.addSplitsBack() is invoked, it is for sure a 
>>>>>>>>>>>>> partial
>>>>>>>>>>>>> recovery. And the enumerator should remove these splits from the 
>>>>>>>>>>>>> split
>>>>>>>>>>>>> assignment map as if they were never assigned.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think this should work, right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Sep 25, 2025 at 8:34 PM Hongshun Wang <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Becket and Leonard,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your advice.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > put all the reader information in the SplitEnumerator
>>>>>>>>>>>>>> context
>>>>>>>>>>>>>> I have a concern: the current registeredReaders in*
>>>>>>>>>>>>>> SourceCoordinatorContext will be removed after subtaskResetor 
>>>>>>>>>>>>>> execution on
>>>>>>>>>>>>>> failure*.However, this approach has merit.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One more situation I found my previous design does not cover:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    1. Initial state: Reader A reports splits (1, 2).
>>>>>>>>>>>>>>    2. Enumerator action: Assigns split 1 to Reader A, and
>>>>>>>>>>>>>>    split 2 to Reader B.
>>>>>>>>>>>>>>    3. Failure scenario: Reader A fails before checkpointing.
>>>>>>>>>>>>>>    Since this is a partial failure, only Reader A restarts.
>>>>>>>>>>>>>>    4. Recovery issue: Upon recovery, Reader A re-reports
>>>>>>>>>>>>>>    split (1).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In my previous design, the enumerator will ignore Reader A's
>>>>>>>>>>>>>> re-registration which will cause data loss.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thus, when the enumerator receives a split, the split may
>>>>>>>>>>>>>> originate from three scenarios:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    1. The split is reported by a reader during a global
>>>>>>>>>>>>>>    restoration.
>>>>>>>>>>>>>>    2. The split is reported by a reader during a partial
>>>>>>>>>>>>>>    failure recovery.
>>>>>>>>>>>>>>    3. The split is not reported by a reader, but is assigned
>>>>>>>>>>>>>>    after the last successful checkpoint and was never 
>>>>>>>>>>>>>> acknowledged.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the first scenario (global restore), the split should
>>>>>>>>>>>>>> be re-distributed. For the latter two scenarios (partial 
>>>>>>>>>>>>>> failover and
>>>>>>>>>>>>>> post-checkpoint assignment), we need to reassign the split to
>>>>>>>>>>>>>> its originally assigned subtask.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> By implementing a method in the SplitEnumerator context to
>>>>>>>>>>>>>> track each assigned split's status, the system can correctly 
>>>>>>>>>>>>>> identify and
>>>>>>>>>>>>>> resolve split ownership in all three scenarios.*What about
>>>>>>>>>>>>>> adding a  `SplitRecoveryType splitRecoveryType(Split split)` in
>>>>>>>>>>>>>> SplitEnumeratorContext.* SplitRecoveryTypeis a enum
>>>>>>>>>>>>>> including `UNASSIGNED`、`GLOBAL_RESTORE`、`PARTIAL_FAILOVER` and
>>>>>>>>>>>>>> `POST_CHECKPOINT_ASSIGNMENT`.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What do you think? Are there any details or scenarios I
>>>>>>>>>>>>>> haven't considered? Looking forward to your advice.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Hongshun
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Sep 11, 2025 at 12:41 AM Becket Qin <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the explanation, Hongshun.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Current pattern of handling new reader registration
>>>>>>>>>>>>>>> following:
>>>>>>>>>>>>>>> 1. put all the reader information in the SplitEnumerator
>>>>>>>>>>>>>>> context
>>>>>>>>>>>>>>> 2. notify the enumerator about the new reader registration.
>>>>>>>>>>>>>>> 3. Let the split enumerator get whatever information it
>>>>>>>>>>>>>>> wants from the
>>>>>>>>>>>>>>> context and do its job.
>>>>>>>>>>>>>>> This pattern decouples the information passing and the
>>>>>>>>>>>>>>> reader registration
>>>>>>>>>>>>>>> notification. This makes the API extensible - we can add
>>>>>>>>>>>>>>> more information
>>>>>>>>>>>>>>> (e.g. reported assigned splits in our case) about the reader
>>>>>>>>>>>>>>> to the context
>>>>>>>>>>>>>>> without introducing new methods.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Introducing a new method of addSplitBackOnRecovery() is
>>>>>>>>>>>>>>> redundant to the
>>>>>>>>>>>>>>> above pattern. Do we really need it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 8:18 PM Hongshun Wang <
>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > Hi Becket,
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > > I am curious what would the enumerator do differently
>>>>>>>>>>>>>>> for the splits
>>>>>>>>>>>>>>> > added via addSplitsBackOnRecovery() V.S. addSplitsBack()?
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> >  In this FLIP, there are two distinct scenarios in which
>>>>>>>>>>>>>>> the enumerator
>>>>>>>>>>>>>>> > receives splits being added back:
>>>>>>>>>>>>>>> > 1.  Job-level restore: The job is restored,  splits from
>>>>>>>>>>>>>>> reader’s state are
>>>>>>>>>>>>>>> > reported by ReaderRegistrationEvent.
>>>>>>>>>>>>>>> > 2.  Reader-level restart: a reader is started but not the
>>>>>>>>>>>>>>> whole  job,
>>>>>>>>>>>>>>> >  splits assigned to it after the last successful
>>>>>>>>>>>>>>> checkpoint. This is what
>>>>>>>>>>>>>>> > addSplitsBack used to do.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > In these two situations, the enumerator will choose
>>>>>>>>>>>>>>> different strategies.
>>>>>>>>>>>>>>> > 1. Job-level restore: the splits should be redistributed
>>>>>>>>>>>>>>> across readers
>>>>>>>>>>>>>>> > according to the current partitioner strategy.
>>>>>>>>>>>>>>> > 2. Reader-level restart: the splits should be reassigned
>>>>>>>>>>>>>>> directly back to
>>>>>>>>>>>>>>> > the same reader they were originally assigned to,
>>>>>>>>>>>>>>> preserving locality and
>>>>>>>>>>>>>>> > avoiding unnecessary redistribution
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Therefore, the enumerator must clearly distinguish between
>>>>>>>>>>>>>>> these two
>>>>>>>>>>>>>>> > scenarios.I used to deprecate the former
>>>>>>>>>>>>>>> addSplitsBack(List<SplitT>
>>>>>>>>>>>>>>> > splits, int subtaskId) but add a new
>>>>>>>>>>>>>>> addSplitsBack(List<SplitT>
>>>>>>>>>>>>>>> > splits, int subtaskId,
>>>>>>>>>>>>>>> > boolean reportedByReader).
>>>>>>>>>>>>>>> > Leonard suggest to use another method
>>>>>>>>>>>>>>> addSplitsBackOnRecovery but not
>>>>>>>>>>>>>>> > influenced  currently addSplitsBack.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Best
>>>>>>>>>>>>>>> > Hongshun
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > On 2025/09/08 17:20:31 Becket Qin wrote:
>>>>>>>>>>>>>>> > > Hi Leonard,
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > > Could we introduce a new method like
>>>>>>>>>>>>>>> addSplitsBackOnRecovery  with
>>>>>>>>>>>>>>> > default
>>>>>>>>>>>>>>> > > > implementation. In this way, we can provide better
>>>>>>>>>>>>>>> backward
>>>>>>>>>>>>>>> > compatibility
>>>>>>>>>>>>>>> > > > and also makes it easier for developers to understand.
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > I am curious what would the enumerator do differently
>>>>>>>>>>>>>>> for the splits
>>>>>>>>>>>>>>> > added
>>>>>>>>>>>>>>> > > via addSplitsBackOnRecovery() V.S. addSplitsBack()?
>>>>>>>>>>>>>>> Today,
>>>>>>>>>>>>>>> > addSplitsBack()
>>>>>>>>>>>>>>> > > is also only called upon recovery. So the new method
>>>>>>>>>>>>>>> seems confusing. One
>>>>>>>>>>>>>>> > > thing worth clarifying is if the Source implements
>>>>>>>>>>>>>>> > > SupportSplitReassignmentOnRecovery, upon recovery,
>>>>>>>>>>>>>>> should the splits
>>>>>>>>>>>>>>> > > reported by the readers also be added back to the
>>>>>>>>>>>>>>> SplitEnumerator via the
>>>>>>>>>>>>>>> > > addSplitsBack() call? Or should the SplitEnumerator
>>>>>>>>>>>>>>> explicitly query the
>>>>>>>>>>>>>>> > > registered reader information via the
>>>>>>>>>>>>>>> SplitEnumeratorContext to get the
>>>>>>>>>>>>>>> > > originally assigned splits when addReader() is invoked?
>>>>>>>>>>>>>>> I was assuming
>>>>>>>>>>>>>>> > the
>>>>>>>>>>>>>>> > > latter in the beginning, so the behavior of
>>>>>>>>>>>>>>> addSplitsBack() remains
>>>>>>>>>>>>>>> > > unchanged, but I am not opposed in doing the former.
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > Also, can you elaborate on the backwards compatibility
>>>>>>>>>>>>>>> issue you see if
>>>>>>>>>>>>>>> > we
>>>>>>>>>>>>>>> > > do not have a separate addSplitsBackOnRecovery() method?
>>>>>>>>>>>>>>> Even without
>>>>>>>>>>>>>>> > this
>>>>>>>>>>>>>>> > > new method, the behavior remains exactly the same unless
>>>>>>>>>>>>>>> the end users
>>>>>>>>>>>>>>> > > implement the mix-in interface of
>>>>>>>>>>>>>>> "SupportSplitReassignmentOnRecovery",
>>>>>>>>>>>>>>> > > right?
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > Thanks,
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > Jiangjie (Becket) Qin
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > On Mon, Sep 8, 2025 at 1:48 AM Hongshun Wang <
>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>> > > wrote:
>>>>>>>>>>>>>>> > >
>>>>>>>>>>>>>>> > > > Hi devs,
>>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>>> > > > It has been quite some time since this FLIP[1] was
>>>>>>>>>>>>>>> first proposed.
>>>>>>>>>>>>>>> > Thank
>>>>>>>>>>>>>>> > > > you for your valuable feedback—based on your
>>>>>>>>>>>>>>> suggestions, the FLIP has
>>>>>>>>>>>>>>> > > > undergone several rounds of revisions.
>>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>>> > > > Any more advice is welcome and appreciated. If there
>>>>>>>>>>>>>>> are no further
>>>>>>>>>>>>>>> > > > concerns, I plan to start the vote tomorrow.
>>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>>> > > > Best
>>>>>>>>>>>>>>> > > > Hongshun
>>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>>> > > > [1]
>>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=373886480
>>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>>> > > > On Mon, Sep 8, 2025 at 4:42 PM Hongshun Wang <
>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>> > > > wrote:
>>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>>> > > > > Hi Leonard,
>>>>>>>>>>>>>>> > > > > Thanks for your advice.  It makes sense and I have
>>>>>>>>>>>>>>> modified it.
>>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>>> > > > > Best,
>>>>>>>>>>>>>>> > > > > Hongshun
>>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>>> > > > > On Mon, Sep 8, 2025 at 11:40 AM Leonard Xu <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>>> > > > >> Thanks Hongshun and Becket for the deep discussion.
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >> I only have one comment for one API design:
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >> > Deprecate the old addSplitsBack  method, use a
>>>>>>>>>>>>>>> addSplitsBack with
>>>>>>>>>>>>>>> > > > >> param isReportedByReader instead. Because, The
>>>>>>>>>>>>>>> enumerator can apply
>>>>>>>>>>>>>>> > > > >> different reassignment policies based on the
>>>>>>>>>>>>>>> context.
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >> Could we introduce a new method like
>>>>>>>>>>>>>>> *addSplitsBackOnRecovery*  with
>>>>>>>>>>>>>>> > > > default
>>>>>>>>>>>>>>> > > > >> implementation. In this way, we can provide better
>>>>>>>>>>>>>>> backward
>>>>>>>>>>>>>>> > > > >> compatibility and also makes it easier for
>>>>>>>>>>>>>>> developers to understand.
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >> Best,
>>>>>>>>>>>>>>> > > > >> Leonard
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >> 2025 9月 3 20:26，Hongshun Wang <[email protected]> 写道：
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >> Hi Becket,
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >> I think that's a great idea!  I have added the
>>>>>>>>>>>>>>> > > > >> SupportSplitReassignmentOnRecovery interface in
>>>>>>>>>>>>>>> this FLIP. If a
>>>>>>>>>>>>>>> > Source
>>>>>>>>>>>>>>> > > > >> implements this interface indicates that the source
>>>>>>>>>>>>>>> operator needs
>>>>>>>>>>>>>>> > to
>>>>>>>>>>>>>>> > > > >> report splits to the enumerator and receive
>>>>>>>>>>>>>>> reassignment.[1]
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >> Best,
>>>>>>>>>>>>>>> > > > >> Hongshun
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >> [1]
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-537%3A+Enumerator+with+Global+Split+Assignment+Distribution+for+Balanced+Split+assignment
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >> On Thu, Aug 21, 2025 at 12:09 PM Becket Qin <
>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>> > > > wrote:
>>>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>>>> > > > >>> Hi Hongshun,
>>>>>>>>>>>>>>> > > > >>>
>>>>>>>>>>>>>>> > > > >>> I think the convention for such optional features
>>>>>>>>>>>>>>> in Source is via
>>>>>>>>>>>>>>> > > > >>> mix-in interfaces. So instead of adding a method
>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>> > SourceReader,
>>>>>>>>>>>>>>> > > > maybe
>>>>>>>>>>>>>>> > > > >>> we should introduce an interface
>>>>>>>>>>>>>>> SupportSplitReassingmentOnRecovery
>>>>>>>>>>>>>>> > > > with
>>>>>>>>>>>>>>> > > > >>> this method. If a Source implementation implements
>>>>>>>>>>>>>>> that interface,
>>>>>>>>>>>>>>> > > > then the
>>>>>>>>>>>>>>> > > > >>> SourceOperator will check the desired behavior and
>>>>>>>>>>>>>>> act accordingly.
>>>>>>>>>>>>>>> > > > >>>
>>>>>>>>>>>>>>> > > > >>> Thanks,
>>>>>>>>>>>>>>> > > > >>>
>>>>>>>>>>>>>>> > > > >>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>> > > > >>>
>>>>>>>>>>>>>>> > > > >>> On Wed, Aug 20, 2025 at 8:52 PM Hongshun Wang <
>>>>>>>>>>>>>>> > [email protected]
>>>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>>>> > > > >>> wrote:
>>>>>>>>>>>>>>> > > > >>>
>>>>>>>>>>>>>>> > > > >>>> Hi de vs,
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> Would anyone like to discuss this FLIP? I'd
>>>>>>>>>>>>>>> appreciate your
>>>>>>>>>>>>>>> > feedback
>>>>>>>>>>>>>>> > > > >>>> and suggestions.
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> Best,
>>>>>>>>>>>>>>> > > > >>>> Hongshun
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> 2025年8月13日 14:23，Hongshun Wang <[email protected]>
>>>>>>>>>>>>>>> 写道：
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> Hi Becket,
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> Thank you for your detailed feedback. The new
>>>>>>>>>>>>>>> contract makes good
>>>>>>>>>>>>>>> > > > sense
>>>>>>>>>>>>>>> > > > >>>> to me and effectively addresses the issues I
>>>>>>>>>>>>>>> encountered at the
>>>>>>>>>>>>>>> > > > beginning
>>>>>>>>>>>>>>> > > > >>>> of the design.
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> That said, I recommend not reporting splits by
>>>>>>>>>>>>>>> default, primarily
>>>>>>>>>>>>>>> > for
>>>>>>>>>>>>>>> > > > >>>> compatibility and practical reasons:
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> >  For these reasons, we do not expect the Split
>>>>>>>>>>>>>>> objects to be
>>>>>>>>>>>>>>> > huge,
>>>>>>>>>>>>>>> > > > >>>> and we are not trying to design for huge Split
>>>>>>>>>>>>>>> objects either as
>>>>>>>>>>>>>>> > they
>>>>>>>>>>>>>>> > > > will
>>>>>>>>>>>>>>> > > > >>>> have problems even today.
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>>    1.
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>>    Not all existing connector match this rule
>>>>>>>>>>>>>>> > > > >>>>    For example, in mysql cdc connector, a binlog
>>>>>>>>>>>>>>> split may contain
>>>>>>>>>>>>>>> > > > >>>>    hundreds (or even more) snapshot split
>>>>>>>>>>>>>>> completion records. This
>>>>>>>>>>>>>>> > > > state is
>>>>>>>>>>>>>>> > > > >>>>    large and is currently transmitted
>>>>>>>>>>>>>>> incrementally through
>>>>>>>>>>>>>>> > multiple
>>>>>>>>>>>>>>> > > > >>>>    BinlogSplitMetaEvent messages. Since the
>>>>>>>>>>>>>>> binlog reader operates
>>>>>>>>>>>>>>> > > > >>>>    with single parallelism, reporting the full
>>>>>>>>>>>>>>> split state on
>>>>>>>>>>>>>>> > recovery
>>>>>>>>>>>>>>> > > > >>>>    could be inefficient or even infeasible.
>>>>>>>>>>>>>>> > > > >>>>    For such sources, it would be better to
>>>>>>>>>>>>>>> provide a mechanism to
>>>>>>>>>>>>>>> > skip
>>>>>>>>>>>>>>> > > > >>>>    split reporting during restart until they
>>>>>>>>>>>>>>> redesign and reduce
>>>>>>>>>>>>>>> > the
>>>>>>>>>>>>>>> > > > >>>>    split size.
>>>>>>>>>>>>>>> > > > >>>>    2.
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>>    Not all enumerators maintain unassigned splits
>>>>>>>>>>>>>>> in state.
>>>>>>>>>>>>>>> > > > >>>>    Some SplitEnumerator(such as kafka connector)
>>>>>>>>>>>>>>> implementations
>>>>>>>>>>>>>>> > do
>>>>>>>>>>>>>>> > > > >>>>    not track or persistently manage unassigned
>>>>>>>>>>>>>>> splits. Requiring
>>>>>>>>>>>>>>> > them
>>>>>>>>>>>>>>> > > > to
>>>>>>>>>>>>>>> > > > >>>>    handle re-registration would add unnecessary
>>>>>>>>>>>>>>> complexity. Even
>>>>>>>>>>>>>>> > > > though we
>>>>>>>>>>>>>>> > > > >>>>    maybe implements in kafka connector,
>>>>>>>>>>>>>>> currently, kafka connector
>>>>>>>>>>>>>>> > is
>>>>>>>>>>>>>>> > > > decouple
>>>>>>>>>>>>>>> > > > >>>>    with flink version, we also need to make sure
>>>>>>>>>>>>>>> the elder version
>>>>>>>>>>>>>>> > is
>>>>>>>>>>>>>>> > > > >>>>    compatible.
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> ------------------------------
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> To address these concerns, I propose introducing
>>>>>>>>>>>>>>> a new method:
>>>>>>>>>>>>>>> > boolean
>>>>>>>>>>>>>>> > > > >>>> SourceReader#shouldReassignSplitsOnRecovery()
>>>>>>>>>>>>>>> with a default
>>>>>>>>>>>>>>> > > > >>>> implementation returning false. This allows
>>>>>>>>>>>>>>> source readers to opt
>>>>>>>>>>>>>>> > in
>>>>>>>>>>>>>>> > > > >>>> to split reassignment only when necessary. Since
>>>>>>>>>>>>>>> the new contract
>>>>>>>>>>>>>>> > > > already
>>>>>>>>>>>>>>> > > > >>>> places the responsibility for split assignment on
>>>>>>>>>>>>>>> the enumerator,
>>>>>>>>>>>>>>> > not
>>>>>>>>>>>>>>> > > > >>>> reporting splits by default is a safe and clean
>>>>>>>>>>>>>>> default behavior.
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> ------------------------------
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> I’ve updated the implementation and the FIP
>>>>>>>>>>>>>>> accordingly[1]. It
>>>>>>>>>>>>>>> > quite a
>>>>>>>>>>>>>>> > > > >>>> big change. In particular, for the Kafka
>>>>>>>>>>>>>>> connector, we can now use
>>>>>>>>>>>>>>> > a
>>>>>>>>>>>>>>> > > > >>>> pluggable SplitPartitioner to support different
>>>>>>>>>>>>>>> split assignment
>>>>>>>>>>>>>>> > > > >>>> strategies (e.g., default, round-robin).
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> Could you please review it when you have a chance?
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> Best,
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> Hongshun
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> [1]
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > >
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-537%3A+Enumerator+with+Global+Split+Assignment+Distribution+for+Balanced+Split+assignment
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>> On Sat, Aug 9, 2025 at 3:03 AM Becket Qin <
>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>> > > > wrote:
>>>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>>>> > > > >>>>> Hi Hongshun,
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>> I am not too concerned about the transmission
>>>>>>>>>>>>>>> cost. Because the
>>>>>>>>>>>>>>> > full
>>>>>>>>>>>>>>> > > > >>>>> split transmission has to happen in the initial
>>>>>>>>>>>>>>> assignment phase
>>>>>>>>>>>>>>> > > > already.
>>>>>>>>>>>>>>> > > > >>>>> And in the future, we probably want to also
>>>>>>>>>>>>>>> introduce some kind
>>>>>>>>>>>>>>> > of
>>>>>>>>>>>>>>> > > > workload
>>>>>>>>>>>>>>> > > > >>>>> balance across source readers, e.g. based on the
>>>>>>>>>>>>>>> per-split
>>>>>>>>>>>>>>> > > > throughput or
>>>>>>>>>>>>>>> > > > >>>>> the per-source-reader workload in heterogeneous
>>>>>>>>>>>>>>> clusters. For
>>>>>>>>>>>>>>> > these
>>>>>>>>>>>>>>> > > > >>>>> reasons, we do not expect the Split objects to
>>>>>>>>>>>>>>> be huge, and we
>>>>>>>>>>>>>>> > are
>>>>>>>>>>>>>>> > > > not
>>>>>>>>>>>>>>> > > > >>>>> trying to design for huge Split objects either
>>>>>>>>>>>>>>> as they will have
>>>>>>>>>>>>>>> > > > problems
>>>>>>>>>>>>>>> > > > >>>>> even today.
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>> Good point on the potential split loss, please
>>>>>>>>>>>>>>> see the reply
>>>>>>>>>>>>>>> > below:
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>> Scenario 2:
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>>> 1. Reader A reports splits (1 and 2), and
>>>>>>>>>>>>>>> Reader B reports (3
>>>>>>>>>>>>>>> > and 4)
>>>>>>>>>>>>>>> > > > >>>>>> upon restart.
>>>>>>>>>>>>>>> > > > >>>>>> 2. Before the enumerator receives all reports
>>>>>>>>>>>>>>> and performs
>>>>>>>>>>>>>>> > > > >>>>>> reassignment, a checkpoint is triggered.
>>>>>>>>>>>>>>> > > > >>>>>> 3. Since no splits have been reassigned yet,
>>>>>>>>>>>>>>> both readers have
>>>>>>>>>>>>>>> > empty
>>>>>>>>>>>>>>> > > > >>>>>> states.
>>>>>>>>>>>>>>> > > > >>>>>> 4. When restarting from this checkpoint, all
>>>>>>>>>>>>>>> four splits are
>>>>>>>>>>>>>>> > lost.
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>> The reader registration happens in the
>>>>>>>>>>>>>>> SourceOperator.open(),
>>>>>>>>>>>>>>> > which
>>>>>>>>>>>>>>> > > > >>>>> means the task is still in the initializing
>>>>>>>>>>>>>>> state, therefore the
>>>>>>>>>>>>>>> > > > checkpoint
>>>>>>>>>>>>>>> > > > >>>>> should not be triggered until the enumerator
>>>>>>>>>>>>>>> receives all the
>>>>>>>>>>>>>>> > split
>>>>>>>>>>>>>>> > > > reports.
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>> There is a nuance here. Today, the RPC call from
>>>>>>>>>>>>>>> the TM to the JM
>>>>>>>>>>>>>>> > is
>>>>>>>>>>>>>>> > > > >>>>> async. So it is possible that the
>>>>>>>>>>>>>>> SourceOpertor.open() has
>>>>>>>>>>>>>>> > returned,
>>>>>>>>>>>>>>> > > > but
>>>>>>>>>>>>>>> > > > >>>>> the enumerator has not received the split
>>>>>>>>>>>>>>> reports. However,
>>>>>>>>>>>>>>> > because
>>>>>>>>>>>>>>> > > > the
>>>>>>>>>>>>>>> > > > >>>>> task status update RPC call goes to the same
>>>>>>>>>>>>>>> channel as the split
>>>>>>>>>>>>>>> > > > reports
>>>>>>>>>>>>>>> > > > >>>>> call, so the task status RPC call will happen
>>>>>>>>>>>>>>> after the split
>>>>>>>>>>>>>>> > > > reports call
>>>>>>>>>>>>>>> > > > >>>>> on the JM side. Therefore, on the JM side, the
>>>>>>>>>>>>>>> SourceCoordinator
>>>>>>>>>>>>>>> > will
>>>>>>>>>>>>>>> > > > >>>>> always first receive the split reports, then
>>>>>>>>>>>>>>> receive the
>>>>>>>>>>>>>>> > checkpoint
>>>>>>>>>>>>>>> > > > request.
>>>>>>>>>>>>>>> > > > >>>>> This "happen before" relationship is kind of
>>>>>>>>>>>>>>> important to
>>>>>>>>>>>>>>> > guarantee
>>>>>>>>>>>>>>> > > > >>>>> the consistent state between enumerator and
>>>>>>>>>>>>>>> readers.
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>> Scenario 1:
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>>> 1. Upon restart, Reader A reports assigned
>>>>>>>>>>>>>>> splits (1 and 2), and
>>>>>>>>>>>>>>> > > > >>>>>> Reader B reports (3 and 4).
>>>>>>>>>>>>>>> > > > >>>>>> 2. The enumerator receives these reports but
>>>>>>>>>>>>>>> only reassigns
>>>>>>>>>>>>>>> > splits 1
>>>>>>>>>>>>>>> > > > >>>>>> and 2 — not 3 and 4.
>>>>>>>>>>>>>>> > > > >>>>>> 3. A checkpoint or savepoint is then triggered.
>>>>>>>>>>>>>>> Only splits 1
>>>>>>>>>>>>>>> > and 2
>>>>>>>>>>>>>>> > > > >>>>>> are recorded in the reader states; splits 3 and
>>>>>>>>>>>>>>> 4 are not
>>>>>>>>>>>>>>> > persisted.
>>>>>>>>>>>>>>> > > > >>>>>> 4. If the job is later restarted from this
>>>>>>>>>>>>>>> checkpoint, splits 3
>>>>>>>>>>>>>>> > and
>>>>>>>>>>>>>>> > > > 4
>>>>>>>>>>>>>>> > > > >>>>>> will be permanently lost.
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>> This scenario is possible. One solution is to
>>>>>>>>>>>>>>> let the enumerator
>>>>>>>>>>>>>>> > > > >>>>> implementation handle this. That means if the
>>>>>>>>>>>>>>> enumerator relies
>>>>>>>>>>>>>>> > on
>>>>>>>>>>>>>>> > > > the
>>>>>>>>>>>>>>> > > > >>>>> initial split reports from the source readers,
>>>>>>>>>>>>>>> it should maintain
>>>>>>>>>>>>>>> > > > these
>>>>>>>>>>>>>>> > > > >>>>> reports by itself. In the above example, the
>>>>>>>>>>>>>>> enumerator will need
>>>>>>>>>>>>>>> > to
>>>>>>>>>>>>>>> > > > >>>>> remember that 3 and 4 are not assigned and put
>>>>>>>>>>>>>>> it into its own
>>>>>>>>>>>>>>> > state.
>>>>>>>>>>>>>>> > > > >>>>> The current contract is that anything assigned
>>>>>>>>>>>>>>> to the
>>>>>>>>>>>>>>> > SourceReaders
>>>>>>>>>>>>>>> > > > >>>>> are completely owned by the SourceReaders.
>>>>>>>>>>>>>>> Enumerators can
>>>>>>>>>>>>>>> > remember
>>>>>>>>>>>>>>> > > > the
>>>>>>>>>>>>>>> > > > >>>>> assignments but cannot change them, even when
>>>>>>>>>>>>>>> the source reader
>>>>>>>>>>>>>>> > > > recovers /
>>>>>>>>>>>>>>> > > > >>>>> restarts.
>>>>>>>>>>>>>>> > > > >>>>> With this FLIP, the contract becomes that the
>>>>>>>>>>>>>>> source readers will
>>>>>>>>>>>>>>> > > > >>>>> return the ownership of the splits to the
>>>>>>>>>>>>>>> enumerator. So the
>>>>>>>>>>>>>>> > > > enumerator is
>>>>>>>>>>>>>>> > > > >>>>> responsible for maintaining these splits, until
>>>>>>>>>>>>>>> they are assigned
>>>>>>>>>>>>>>> > to
>>>>>>>>>>>>>>> > > > a
>>>>>>>>>>>>>>> > > > >>>>> source reader again.
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>> There are other cases where there may be
>>>>>>>>>>>>>>> conflict information
>>>>>>>>>>>>>>> > between
>>>>>>>>>>>>>>> > > > >>>>> reader and enumerator. For example, consider the
>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>> > sequence:
>>>>>>>>>>>>>>> > > > >>>>> 1. reader A reports splits (1 and 2) up on
>>>>>>>>>>>>>>> restart.
>>>>>>>>>>>>>>> > > > >>>>> 2. enumerator receives the report and assigns
>>>>>>>>>>>>>>> both 1 and 2 to
>>>>>>>>>>>>>>> > reader
>>>>>>>>>>>>>>> > > > B.
>>>>>>>>>>>>>>> > > > >>>>> 3. reader A failed before checkpointing. And
>>>>>>>>>>>>>>> this is a partial
>>>>>>>>>>>>>>> > > > >>>>> failure, so only reader A restarts.
>>>>>>>>>>>>>>> > > > >>>>> 4. When reader A recovers, it will again report
>>>>>>>>>>>>>>> splits (1 and 2)
>>>>>>>>>>>>>>> > to
>>>>>>>>>>>>>>> > > > >>>>> the enumerator.
>>>>>>>>>>>>>>> > > > >>>>> 5. The enumerator should ignore this report
>>>>>>>>>>>>>>> because it has
>>>>>>>>>>>>>>> > > > >>>>> assigned splits (1 and 2) to reader B.
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>> So with the new contract, the enumerator should
>>>>>>>>>>>>>>> be the source of
>>>>>>>>>>>>>>> > > > truth
>>>>>>>>>>>>>>> > > > >>>>> for split ownership.
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>> Thanks,
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>> On Fri, Aug 8, 2025 at 12:58 AM Hongshun Wang <
>>>>>>>>>>>>>>> > > > [email protected]>
>>>>>>>>>>>>>>> > > > >>>>> wrote:
>>>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>>>> > > > >>>>>> Hi Becket,
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>> I did consider this approach at the beginning
>>>>>>>>>>>>>>> (and it was also
>>>>>>>>>>>>>>> > > > >>>>>> mentioned in this FLIP), since it would allow
>>>>>>>>>>>>>>> more flexibility
>>>>>>>>>>>>>>> > in
>>>>>>>>>>>>>>> > > > >>>>>> reassigning all splits. However, there are a
>>>>>>>>>>>>>>> few potential
>>>>>>>>>>>>>>> > issues.
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>> 1. High Transmission Cost
>>>>>>>>>>>>>>> > > > >>>>>> If we pass the full split objects (rather than
>>>>>>>>>>>>>>> just split IDs),
>>>>>>>>>>>>>>> > the
>>>>>>>>>>>>>>> > > > >>>>>> data size could be significant, leading to high
>>>>>>>>>>>>>>> overhead during
>>>>>>>>>>>>>>> > > > >>>>>> transmission — especially when many splits are
>>>>>>>>>>>>>>> involved.
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>> 2. Risk of Split Loss
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>> Risk of split loss exists unless we have a
>>>>>>>>>>>>>>> mechanism to make
>>>>>>>>>>>>>>> > sure
>>>>>>>>>>>>>>> > > > >>>>>> only can checkpoint after all the splits are
>>>>>>>>>>>>>>> reassigned.
>>>>>>>>>>>>>>> > > > >>>>>> There are scenarios where splits could be lost
>>>>>>>>>>>>>>> due to
>>>>>>>>>>>>>>> > inconsistent
>>>>>>>>>>>>>>> > > > >>>>>> state handling during recovery:
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>> Scenario 1:
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>>    1. Upon restart, Reader A reports assigned
>>>>>>>>>>>>>>> splits (1 and 2),
>>>>>>>>>>>>>>> > and
>>>>>>>>>>>>>>> > > > >>>>>>    Reader B reports (3 and 4).
>>>>>>>>>>>>>>> > > > >>>>>>    2. The enumerator receives these reports but
>>>>>>>>>>>>>>> only reassigns
>>>>>>>>>>>>>>> > > > >>>>>>    splits 1 and 2 — not 3 and 4.
>>>>>>>>>>>>>>> > > > >>>>>>    3. A checkpoint or savepoint is then
>>>>>>>>>>>>>>> triggered. Only splits 1
>>>>>>>>>>>>>>> > and
>>>>>>>>>>>>>>> > > > >>>>>>    2 are recorded in the reader states; splits
>>>>>>>>>>>>>>> 3 and 4 are not
>>>>>>>>>>>>>>> > > > persisted.
>>>>>>>>>>>>>>> > > > >>>>>>    4. If the job is later restarted from this
>>>>>>>>>>>>>>> checkpoint, splits
>>>>>>>>>>>>>>> > 3
>>>>>>>>>>>>>>> > > > >>>>>>    and 4 will be permanently lost.
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>> Scenario 2:
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>>    1. Reader A reports splits (1 and 2), and
>>>>>>>>>>>>>>> Reader B reports (3
>>>>>>>>>>>>>>> > and
>>>>>>>>>>>>>>> > > > >>>>>>    4) upon restart.
>>>>>>>>>>>>>>> > > > >>>>>>    2. Before the enumerator receives all
>>>>>>>>>>>>>>> reports and performs
>>>>>>>>>>>>>>> > > > >>>>>>    reassignment, a checkpoint is triggered.
>>>>>>>>>>>>>>> > > > >>>>>>    3. Since no splits have been reassigned yet,
>>>>>>>>>>>>>>> both readers
>>>>>>>>>>>>>>> > have
>>>>>>>>>>>>>>> > > > >>>>>>    empty states.
>>>>>>>>>>>>>>> > > > >>>>>>    4. When restarting from this checkpoint, all
>>>>>>>>>>>>>>> four splits are
>>>>>>>>>>>>>>> > > > lost.
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>> Let me know if you have thoughts on how we
>>>>>>>>>>>>>>> might mitigate these
>>>>>>>>>>>>>>> > > > risks!
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>> Best
>>>>>>>>>>>>>>> > > > >>>>>> Hongshun
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>> On Fri, Aug 8, 2025 at 1:46 AM Becket Qin <
>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>> > > > >>>>>> wrote:
>>>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>>>> > > > >>>>>>> Hi Hongshun,
>>>>>>>>>>>>>>> > > > >>>>>>>
>>>>>>>>>>>>>>> > > > >>>>>>> The steps sound reasonable to me in general.
>>>>>>>>>>>>>>> In terms of the
>>>>>>>>>>>>>>> > > > updated
>>>>>>>>>>>>>>> > > > >>>>>>> FLIP wiki, it would be good to see if we can
>>>>>>>>>>>>>>> keep the protocol
>>>>>>>>>>>>>>> > > > simple. One
>>>>>>>>>>>>>>> > > > >>>>>>> alternative way to achieve this behavior is
>>>>>>>>>>>>>>> following:
>>>>>>>>>>>>>>> > > > >>>>>>>
>>>>>>>>>>>>>>> > > > >>>>>>> 1. Upon SourceOperator startup, the
>>>>>>>>>>>>>>> SourceOperator sends
>>>>>>>>>>>>>>> > > > >>>>>>> ReaderRegistrationEvent with the currently
>>>>>>>>>>>>>>> assigned splits to
>>>>>>>>>>>>>>> > the
>>>>>>>>>>>>>>> > > > >>>>>>> enumerator. It does not add these splits to
>>>>>>>>>>>>>>> the SourceReader.
>>>>>>>>>>>>>>> > > > >>>>>>> 2. The enumerator will always use the
>>>>>>>>>>>>>>> > > > >>>>>>> SourceEnumeratorContext.assignSplits() to
>>>>>>>>>>>>>>> assign the splits.
>>>>>>>>>>>>>>> > (not
>>>>>>>>>>>>>>> > > > via the
>>>>>>>>>>>>>>> > > > >>>>>>> response of the SourceRegistrationEvent, this
>>>>>>>>>>>>>>> allows async
>>>>>>>>>>>>>>> > split
>>>>>>>>>>>>>>> > > > assignment
>>>>>>>>>>>>>>> > > > >>>>>>> in case the enumerator wants to wait until all
>>>>>>>>>>>>>>> the readers are
>>>>>>>>>>>>>>> > > > registered)
>>>>>>>>>>>>>>> > > > >>>>>>> 3. The SourceOperator will only call
>>>>>>>>>>>>>>> SourceReader.addSplits()
>>>>>>>>>>>>>>> > when
>>>>>>>>>>>>>>> > > > >>>>>>> it receives the AddSplitEvent from the
>>>>>>>>>>>>>>> enumerator.
>>>>>>>>>>>>>>> > > > >>>>>>>
>>>>>>>>>>>>>>> > > > >>>>>>> This protocol has a few benefits:
>>>>>>>>>>>>>>> > > > >>>>>>> 1. it basically allows arbitrary split
>>>>>>>>>>>>>>> reassignment upon
>>>>>>>>>>>>>>> > restart
>>>>>>>>>>>>>>> > > > >>>>>>> 2. simplicity: there is only one way to assign
>>>>>>>>>>>>>>> splits.
>>>>>>>>>>>>>>> > > > >>>>>>>
>>>>>>>>>>>>>>> > > > >>>>>>> So we only need one interface change:
>>>>>>>>>>>>>>> > > > >>>>>>> - add the initially assigned splits to
>>>>>>>>>>>>>>> ReaderInfo so the
>>>>>>>>>>>>>>> > Enumerator
>>>>>>>>>>>>>>> > > > >>>>>>> can access it.
>>>>>>>>>>>>>>> > > > >>>>>>> and one behavior change:
>>>>>>>>>>>>>>> > > > >>>>>>> - The SourceOperator should stop assigning
>>>>>>>>>>>>>>> splits to the from
>>>>>>>>>>>>>>> > state
>>>>>>>>>>>>>>> > > > >>>>>>> restoration, but
>>>>>>>>>>>>>>> > [message truncated...]
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>

Re: [DISCUSS] FLIP-537: Enumerator with Global Split Assignment Distribution for Balanced Split assignment

Reply via email to