Re: [DISCUSS] FLIP-537: Enumerator with Global Split Assignment Distribution for Balanced Split assignment

Hongshun Wang Sat, 18 Oct 2025 16:21:00 -0700

Hi Becket and Leonard,

It seems adding  `splitsOnRecovery` to `ReaderInfo` makes the split
enumerator simpler and cleaner.


I have modified this FLIP again. Please have a look and let me know what
you think.

Best,
Hongshun

On Mon, Oct 13, 2025 at 10:48 AM Hongshun Wang <[email protected]>
wrote:

> Hi Becket,
> Thanks for your explanation.
>
> > For the same three input above, the assignment should be consistently
> the same.
>
> That is exactly what troubles me. For *assignment algorithms such as
> hash, it does behave the same. What If we use round-robin? Each *the *reader
> information, the same split will be assigned to different readers. There is
> also what I used to list as an example.*
>
>    1. *Initial state:*: 2 parallelism, 2 splits.
>    2. *Enumerator action:*  Split 1 → Task 1, Split 2 → Task 2 ,  ,
>    3. *Failure scenario: *After Split 2 is assigned to Task 2 but before
>    next checkpoint success, task 1 restarts.
>    4. *Recovery issue:* Split 2 is re-added to the enumerator.
>    Round-robin strategy assigns Split 2 to Task 1. Then Task 1 now has 2
>    splits, Task 2 has 0 → Imbalanced distribution.
>
>
> > Please let me know if you think a meeting would be more efficient.
> Yes, I’d like to reach an agreement as soon as possible. If you’re
> available, we could schedule a meeting with Lenenord as well.
>
> Best,
> Hongshun
>
> On Sat, Oct 11, 2025 at 3:59 PM Becket Qin <[email protected]> wrote:
>
>> Hi Hongshun,
>>
>> I am confused. First of all, regardless of what the assignment algorithm
>> is. Using SplitEnumeratorContext to return the splits only gives more
>> information than using addSplitsBack(). So there should be no regression.
>>
>> Secondly, at this point. The SplitEnumerator should only take the
>> following three input to generate the global splits assignment:
>> 1. the *reader information (num readers, locations, etc)*
>> 2. *all the splits to assign*
>> 3. *configured assignment algorithm *
>> Preferably, for the same three input above, the assignment should be
>> consistently the same. I don't see why it should care about why a new
>> reader is added, whether due to partial failover or global failover or job
>> restart.
>>
>> If you want to do global redistribution on global failover and restart,
>> but honor the existing assignment for partial failover. The enumerator will
>> just do the following:
>> 1. Generate a new global assignment (global redistribution) in start()
>> because start() will only be invoked in global failover or restart. That
>> means all the readers are also new with empty assignment.
>> 2. After the global assignment is generated, it should be honored for the
>> whole life cycle. there might be many reader registrations, again for
>> different reasons but does not matter:
>>     - reader registration after this job restart
>>     - reader registration after this global failover
>>     - reader registration due to partial failover which may or may not
>> have a addSplitsBack() call.
>>     Regardless of the reason, the split enumerator will just enforce the
>> global assignment it has already generated, i.e. without split
>> redistribution.
>>
>> Wouldn't that give the behavior you want? I feel the discussion somehow
>> goes to circles. Please let me know if you think a meeting would be more
>> efficient.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>> On Fri, Oct 10, 2025 at 7:58 PM Hongshun Wang <[email protected]>
>> wrote:
>>
>>> Hi Becket,
>>>
>>> > Ignore a returned split if it has been assigned to a different reader,
>>> otherwise put it back to unassigned splits / pending splits. Then the
>>> enumerator assigns new splits to the newly added reader, which may use the
>>> previous assignment as a reference. This should work regardless of whether
>>> it is a global failover, partial failover, restart, etc. There is no need
>>> for the SplitEnumerator to distinguish what failover scenario it is.
>>>
>>> In this case, it seems that global failover and partial failover share
>>> the same distribution strategy If it has not been assigned to a different
>>> reader. However, global failover needs to be redistributed(this is why we
>>> need this FLIP) , while partial failover is not. I have no idea how we
>>> distinguish them.
>>>
>>> What do you think?
>>>
>>> Best,
>>> Hongshun
>>>
>>> On Sat, Oct 11, 2025 at 12:54 AM Becket Qin <[email protected]>
>>> wrote:
>>>
>>>> Hi Hongshun,
>>>>
>>>> The problem we are trying to solve here is to give the splits back to
>>>> the SplitEnumerator. There are only two types of splits to give back:
>>>> 1) splits whose assignment has been checkpointed. - In this case, we
>>>> rely on addReader() + SplitEnumeratorContext to give the splits back, this
>>>> provides more information associated with those splits.
>>>> 2) splits whose assignment has not been checkpointed. -  In this case,
>>>> we use addSplitsBack(), there is no reader info to give because the
>>>> previous assignment did not take effect to begin with.
>>>>
>>>> From the SplitEnumerator implementation perspective, the contract is
>>>> straightforward.
>>>> 1. The SplitEnumerator is the source of truth for assignment.
>>>> 2. When the enumerator receives the addSplits() call, it always add
>>>> these splits back to unassigned splits / pending splits.
>>>> 3. When the enumerator receives the addReader() call, that means the
>>>> reader has no current assignment, and has returned its previous assignment
>>>> based on the reader side info. The SplitEnumerator checks the
>>>> SplitEnumeratorContext to retrieve the returned splits from that reader
>>>> (i.e. previous assignment) and handle them according to its own source of
>>>> truth knowledge of assignment - Ignore a returned split if it has been
>>>> assigned to a different reader, otherwise put it back to unassigned splits
>>>> / pending splits. Then the enumerator assigns new splits to the newly added
>>>> reader, which may use the previous assignment as a reference. This should
>>>> work regardless of whether it is a global failover, partial failover,
>>>> restart, etc. There is no need for the SplitEnumerator to distinguish what
>>>> failover scenario it is.
>>>>
>>>> Would this work?
>>>>
>>>> Thanks,
>>>>
>>>> Jiangjie (Becket) Qin
>>>>
>>>> On Fri, Oct 10, 2025 at 1:28 AM Hongshun Wang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Becket,
>>>>>  > why do we need to change the behavior of addSplitsBack()? Should it
>>>>> remain the same?
>>>>>
>>>>> How does the enumerator get the splits from ReaderRegistrationEvent
>>>>> and then reassign it?
>>>>>
>>>>> You have given a advice before:
>>>>> > 1. Put all the reader information in the SplitEnumerator context.
>>>>> 2. notify the enumerator about the new reader registration. 3. let the
>>>>> split enumerator get whatever information it wants from the context and do
>>>>> its job.
>>>>>
>>>>> However, each time a source task fails over, the
>>>>> ConcurrentMap<Integer, ConcurrentMap<Integer, ReaderInfo>>
>>>>> registeredReaders will remove this reader infos. When the source task is
>>>>> registered again, it will be added again. *Thus, registeredReaders
>>>>> cannot know whether is registered before. *
>>>>>
>>>>> Therefore, registeredReaders enumerator#addReader does not distinguish
>>>>> the following situations:
>>>>> However, each time one source task is failover. The
>>>>> `ConcurrentMap<Integer, ConcurrentMap<Integer, ReaderInfo>>
>>>>> registeredReaders` will remove this source. When source Task is registered
>>>>> again, enumerator#addReader not distinguished three situations:
>>>>> 1. The Reader is registered when the global restart. In this case,
>>>>> redistribution the split from the infos. (take off all the splits from
>>>>> ReaderInfo).
>>>>> 2. The Reader is registered when a partial failover(before the first
>>>>> successful checkpoint). In this case,  ignore the split from the infos.
>>>>> (leave alone all the splits from ReaderInfo).
>>>>> 3. The Reader is registered when a partial failover(after the first
>>>>> successful checkpoint).In this case, we need assign the split to same
>>>>> reader again. (take off all the splits from ReaderInfo but assigned to it
>>>>> again).
>>>>> we still need the enumerator to distinguish them (using
>>>>> pendingSplitAssignment & assignedSplitAssignment. However, it is redundant
>>>>> to maintain split assigned information both in the enumerator and the
>>>>> enumerator context.
>>>>>
>>>>> I think if we change the behavior of addSplitsBack, it will be more
>>>>> simple. Just let the enumerator to handle these split based on 
>>>>> pendingSplitAssignment
>>>>> & assignedSplitments.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Best,
>>>>> Hongshun
>>>>>
>>>>> On Fri, Oct 10, 2025 at 12:55 PM Becket Qin <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Hongshun,
>>>>>>
>>>>>> Thanks for updating the FLIP. A quick question: why do we need to
>>>>>> change the behavior of addSplitsBack()? Should it remain the same?
>>>>>>
>>>>>> Regarding the case of restart with changed subscription. I think the
>>>>>> only correct behavior is removing obsolete splits without any warning /
>>>>>> exception. It is OK to add an info level logging if we want to. It is a
>>>>>> clear intention if the user has explicitly changed subscription and
>>>>>> restarted the job. There is no need to add a config to double confirm.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Jiangjie (Becket) Qin
>>>>>>
>>>>>> On Thu, Oct 9, 2025 at 7:28 PM Hongshun Wang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Leonard,
>>>>>>>
>>>>>>> If the SplitEnumerator received all splits after a restart, it
>>>>>>> becomes straightforward to clear and un-assign the unmatched
>>>>>>> splits(checking whether matches the source options). However, a key
>>>>>>> question arises: *should  automatically discard obsolete splits, or
>>>>>>> explicitly notify the user via an exception?*
>>>>>>>
>>>>>>> We provided a option `scan.partition-unsubscribe.strategy`:
>>>>>>> 1. If Strict, throws an exception when encountering removed splits.
>>>>>>> 2. If Lenient, automatically removes obsolete splits silently.
>>>>>>>
>>>>>>> What Do you think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Hongshun
>>>>>>>
>>>>>>> On Thu, Oct 9, 2025 at 9:37 PM Leonard Xu <[email protected]> wrote:
>>>>>>>
>>>>>>>> Thanks hongshun for the updating and pretty detailed analysis for
>>>>>>>> edge cases,  the updated FLIP looks good to me now.
>>>>>>>>
>>>>>>>> Only last implementation details about scenario in motivation
>>>>>>>> section:
>>>>>>>>
>>>>>>>> *Restart with Changed subscription: During restart, if source
>>>>>>>> options remove a topic or table. The splits which have already 
>>>>>>>> assigned can
>>>>>>>> not be removed.*
>>>>>>>>
>>>>>>>> Could you clarify how we resolve this in Kafka connector ?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Leonard
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2025 10月 9 19:48，Hongshun Wang <[email protected]> 写道：
>>>>>>>>
>>>>>>>> Hi devs,
>>>>>>>> If there are no further suggestions, I will start the voting
>>>>>>>> tomorrow。
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Hongshun
>>>>>>>>
>>>>>>>> On Fri, Sep 26, 2025 at 7:48 PM Hongshun Wang <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Becket and Leonard,
>>>>>>>>>
>>>>>>>>> I have updated the content of this FLIP. The key point is that:
>>>>>>>>>
>>>>>>>>> When the split enumerator receives a split, *these splits must
>>>>>>>>> have already existed in pendingSplitAssignment or assignedSplitments*
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>>    - If the split is in pendingSplitAssignments, ignore it.
>>>>>>>>>    - If the split is in assignedSplitAssignments but has a
>>>>>>>>>    different taskId, ignore it (this indicates it was already
>>>>>>>>>    assigned to another task).
>>>>>>>>>    - If the split is in assignedSplitAssignments and shares the
>>>>>>>>>    same taskId, move the assignment from assignedSplitments to 
>>>>>>>>> pendingSplitAssignment
>>>>>>>>>    to re-assign again.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For better understanding why use these strategies. I added some
>>>>>>>>> examples and pictures to show it.
>>>>>>>>>
>>>>>>>>> Would you like to help me check whether there are still some
>>>>>>>>> problems?
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>> Hongshun
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 26, 2025 at 5:08 PM Leonard Xu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Becket and Hongshun for the insightful discussion.
>>>>>>>>>>
>>>>>>>>>> The underlying implementation and communication mechanisms of
>>>>>>>>>> Flink Source indeed involve many intricate details, we discussed the 
>>>>>>>>>> issue
>>>>>>>>>> of splits re-assignment in specific scenarios, but fortunately, the 
>>>>>>>>>> final
>>>>>>>>>> decision turned out to be pretty clear.
>>>>>>>>>>
>>>>>>>>>>  +1 to Becket’s proposal to keeps the framework cleaner and more
>>>>>>>>>> flexible.
>>>>>>>>>> +1 to Hongshun’s point to provide comprehensive guidance for
>>>>>>>>>> connector developers.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Leonard
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2025 9月 26 16:30，Hongshun Wang <[email protected]> 写道：
>>>>>>>>>>
>>>>>>>>>> Hi Becket,
>>>>>>>>>>
>>>>>>>>>> I Got it. You’re suggesting we should not handle this in the
>>>>>>>>>> source framework but instead let the split enumerator manage these 
>>>>>>>>>> three
>>>>>>>>>> scenarios.
>>>>>>>>>>
>>>>>>>>>> Let me explain why I originally favored handling it in the
>>>>>>>>>> framework: I'm concerned that connector developers might overlook 
>>>>>>>>>> certain
>>>>>>>>>> edge cases (after all, we even payed extensive discussions to fully 
>>>>>>>>>> clarify
>>>>>>>>>> the logic)
>>>>>>>>>>
>>>>>>>>>> However, your point keeps the framework cleaner and more
>>>>>>>>>> flexible. Thus, I will take it.
>>>>>>>>>>
>>>>>>>>>> Perhaps, in this FLIP, we should focus on providing
>>>>>>>>>> comprehensive guidance for connector developers: explain how to
>>>>>>>>>> implement a split enumerator, including the underlying challenges 
>>>>>>>>>> and their
>>>>>>>>>> solutions.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Additionally, we can use the Kafka connector as a reference
>>>>>>>>>> implementation to demonstrate the practical steps. This way, 
>>>>>>>>>> developers who
>>>>>>>>>> want to implement similar connectors can directly reference this 
>>>>>>>>>> example.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Hongshun
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 26, 2025 at 1:27 PM Becket Qin <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> It would be good to not expose runtime details to the source
>>>>>>>>>>> implementation if possible.
>>>>>>>>>>>
>>>>>>>>>>> Today, the split enumerator implementations are expected to
>>>>>>>>>>> track the split assignment.
>>>>>>>>>>>
>>>>>>>>>>> Assuming the split enumerator implementation keeps a split
>>>>>>>>>>> assignment map, that means the enumerator should already know 
>>>>>>>>>>> whether a
>>>>>>>>>>> split is assigned or unassigned. So it can handle the three 
>>>>>>>>>>> scenarios you
>>>>>>>>>>> mentioned.
>>>>>>>>>>>
>>>>>>>>>>> The split is reported by a reader during a global restoration.
>>>>>>>>>>>>
>>>>>>>>>>> The split enumerator should have just been restored / created.
>>>>>>>>>>> If the enumerator expects a full reassignment of splits up on global
>>>>>>>>>>> recovery, there should be no assigned splits to that reader in the 
>>>>>>>>>>> split
>>>>>>>>>>> assignment mapping.
>>>>>>>>>>>
>>>>>>>>>>> The split is reported by a reader during a partial failure
>>>>>>>>>>>> recovery.
>>>>>>>>>>>>
>>>>>>>>>>> In this case, when SplitEnumerator.addReader() is invoked, the
>>>>>>>>>>> split assignment map in the enumerator implementation should 
>>>>>>>>>>> already have
>>>>>>>>>>> some split assignments for the reader. Therefore it is a partial 
>>>>>>>>>>> failover.
>>>>>>>>>>> If the source supports split reassignment on recovery, the 
>>>>>>>>>>> enumerator can
>>>>>>>>>>> assign splits that are different from the reported assignment of 
>>>>>>>>>>> that
>>>>>>>>>>> reader in the SplitEnumeratorContext, or it can also assign the same
>>>>>>>>>>> splits. In any case, the enumerator knows that this is a partial 
>>>>>>>>>>> recovery
>>>>>>>>>>> because the assignment map is non-empty.
>>>>>>>>>>>
>>>>>>>>>>> The split is not reported by a reader, but is assigned after the
>>>>>>>>>>>> last successful checkpoint and was never acknowledged.
>>>>>>>>>>>
>>>>>>>>>>> This is actually one of the step in the partial failure recover.
>>>>>>>>>>> SplitEnumerator.addSplitsBack() will be called first before
>>>>>>>>>>> SplitReader.addReader() is called for the recovered reader. When the
>>>>>>>>>>> SplitEnumerator.addSplitsBack() is invoked, it is for sure a partial
>>>>>>>>>>> recovery. And the enumerator should remove these splits from the 
>>>>>>>>>>> split
>>>>>>>>>>> assignment map as if they were never assigned.
>>>>>>>>>>>
>>>>>>>>>>> I think this should work, right?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Sep 25, 2025 at 8:34 PM Hongshun Wang <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Becket and Leonard,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your advice.
>>>>>>>>>>>>
>>>>>>>>>>>> > put all the reader information in the SplitEnumerator context
>>>>>>>>>>>> I have a concern: the current registeredReaders in*
>>>>>>>>>>>> SourceCoordinatorContext will be removed after subtaskResetor 
>>>>>>>>>>>> execution on
>>>>>>>>>>>> failure*.However, this approach has merit.
>>>>>>>>>>>>
>>>>>>>>>>>> One more situation I found my previous design does not cover:
>>>>>>>>>>>>
>>>>>>>>>>>>    1. Initial state: Reader A reports splits (1, 2).
>>>>>>>>>>>>    2. Enumerator action: Assigns split 1 to Reader A, and
>>>>>>>>>>>>    split 2 to Reader B.
>>>>>>>>>>>>    3. Failure scenario: Reader A fails before checkpointing.
>>>>>>>>>>>>    Since this is a partial failure, only Reader A restarts.
>>>>>>>>>>>>    4. Recovery issue: Upon recovery, Reader A re-reports split
>>>>>>>>>>>>    (1).
>>>>>>>>>>>>
>>>>>>>>>>>> In my previous design, the enumerator will ignore Reader A's
>>>>>>>>>>>> re-registration which will cause data loss.
>>>>>>>>>>>>
>>>>>>>>>>>> Thus, when the enumerator receives a split, the split may
>>>>>>>>>>>> originate from three scenarios:
>>>>>>>>>>>>
>>>>>>>>>>>>    1. The split is reported by a reader during a global
>>>>>>>>>>>>    restoration.
>>>>>>>>>>>>    2. The split is reported by a reader during a partial
>>>>>>>>>>>>    failure recovery.
>>>>>>>>>>>>    3. The split is not reported by a reader, but is assigned
>>>>>>>>>>>>    after the last successful checkpoint and was never acknowledged.
>>>>>>>>>>>>
>>>>>>>>>>>> In the first scenario (global restore), the split should
>>>>>>>>>>>> be re-distributed. For the latter two scenarios (partial failover 
>>>>>>>>>>>> and
>>>>>>>>>>>> post-checkpoint assignment), we need to reassign the split to
>>>>>>>>>>>> its originally assigned subtask.
>>>>>>>>>>>>
>>>>>>>>>>>> By implementing a method in the SplitEnumerator context to
>>>>>>>>>>>> track each assigned split's status, the system can correctly 
>>>>>>>>>>>> identify and
>>>>>>>>>>>> resolve split ownership in all three scenarios.*What about
>>>>>>>>>>>> adding a  `SplitRecoveryType splitRecoveryType(Split split)` in
>>>>>>>>>>>> SplitEnumeratorContext.* SplitRecoveryTypeis a enum including
>>>>>>>>>>>> `UNASSIGNED`、`GLOBAL_RESTORE`、`PARTIAL_FAILOVER` and
>>>>>>>>>>>> `POST_CHECKPOINT_ASSIGNMENT`.
>>>>>>>>>>>>
>>>>>>>>>>>> What do you think? Are there any details or scenarios I haven't
>>>>>>>>>>>> considered? Looking forward to your advice.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Hongshun
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Sep 11, 2025 at 12:41 AM Becket Qin <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the explanation, Hongshun.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Current pattern of handling new reader registration following:
>>>>>>>>>>>>> 1. put all the reader information in the SplitEnumerator
>>>>>>>>>>>>> context
>>>>>>>>>>>>> 2. notify the enumerator about the new reader registration.
>>>>>>>>>>>>> 3. Let the split enumerator get whatever information it wants
>>>>>>>>>>>>> from the
>>>>>>>>>>>>> context and do its job.
>>>>>>>>>>>>> This pattern decouples the information passing and the reader
>>>>>>>>>>>>> registration
>>>>>>>>>>>>> notification. This makes the API extensible - we can add more
>>>>>>>>>>>>> information
>>>>>>>>>>>>> (e.g. reported assigned splits in our case) about the reader
>>>>>>>>>>>>> to the context
>>>>>>>>>>>>> without introducing new methods.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Introducing a new method of addSplitBackOnRecovery() is
>>>>>>>>>>>>> redundant to the
>>>>>>>>>>>>> above pattern. Do we really need it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 8, 2025 at 8:18 PM Hongshun Wang <
>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> > Hi Becket,
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > > I am curious what would the enumerator do differently for
>>>>>>>>>>>>> the splits
>>>>>>>>>>>>> > added via addSplitsBackOnRecovery() V.S. addSplitsBack()?
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >  In this FLIP, there are two distinct scenarios in which the
>>>>>>>>>>>>> enumerator
>>>>>>>>>>>>> > receives splits being added back:
>>>>>>>>>>>>> > 1.  Job-level restore: The job is restored,  splits from
>>>>>>>>>>>>> reader’s state are
>>>>>>>>>>>>> > reported by ReaderRegistrationEvent.
>>>>>>>>>>>>> > 2.  Reader-level restart: a reader is started but not the
>>>>>>>>>>>>> whole  job,
>>>>>>>>>>>>> >  splits assigned to it after the last successful checkpoint.
>>>>>>>>>>>>> This is what
>>>>>>>>>>>>> > addSplitsBack used to do.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > In these two situations, the enumerator will choose
>>>>>>>>>>>>> different strategies.
>>>>>>>>>>>>> > 1. Job-level restore: the splits should be redistributed
>>>>>>>>>>>>> across readers
>>>>>>>>>>>>> > according to the current partitioner strategy.
>>>>>>>>>>>>> > 2. Reader-level restart: the splits should be reassigned
>>>>>>>>>>>>> directly back to
>>>>>>>>>>>>> > the same reader they were originally assigned to, preserving
>>>>>>>>>>>>> locality and
>>>>>>>>>>>>> > avoiding unnecessary redistribution
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Therefore, the enumerator must clearly distinguish between
>>>>>>>>>>>>> these two
>>>>>>>>>>>>> > scenarios.I used to deprecate the former
>>>>>>>>>>>>> addSplitsBack(List<SplitT>
>>>>>>>>>>>>> > splits, int subtaskId) but add a new
>>>>>>>>>>>>> addSplitsBack(List<SplitT>
>>>>>>>>>>>>> > splits, int subtaskId,
>>>>>>>>>>>>> > boolean reportedByReader).
>>>>>>>>>>>>> > Leonard suggest to use another method
>>>>>>>>>>>>> addSplitsBackOnRecovery but not
>>>>>>>>>>>>> > influenced  currently addSplitsBack.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Best
>>>>>>>>>>>>> > Hongshun
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > On 2025/09/08 17:20:31 Becket Qin wrote:
>>>>>>>>>>>>> > > Hi Leonard,
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> > > > Could we introduce a new method like
>>>>>>>>>>>>> addSplitsBackOnRecovery  with
>>>>>>>>>>>>> > default
>>>>>>>>>>>>> > > > implementation. In this way, we can provide better
>>>>>>>>>>>>> backward
>>>>>>>>>>>>> > compatibility
>>>>>>>>>>>>> > > > and also makes it easier for developers to understand.
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> > > I am curious what would the enumerator do differently for
>>>>>>>>>>>>> the splits
>>>>>>>>>>>>> > added
>>>>>>>>>>>>> > > via addSplitsBackOnRecovery() V.S. addSplitsBack()?  Today,
>>>>>>>>>>>>> > addSplitsBack()
>>>>>>>>>>>>> > > is also only called upon recovery. So the new method seems
>>>>>>>>>>>>> confusing. One
>>>>>>>>>>>>> > > thing worth clarifying is if the Source implements
>>>>>>>>>>>>> > > SupportSplitReassignmentOnRecovery, upon recovery, should
>>>>>>>>>>>>> the splits
>>>>>>>>>>>>> > > reported by the readers also be added back to the
>>>>>>>>>>>>> SplitEnumerator via the
>>>>>>>>>>>>> > > addSplitsBack() call? Or should the SplitEnumerator
>>>>>>>>>>>>> explicitly query the
>>>>>>>>>>>>> > > registered reader information via the
>>>>>>>>>>>>> SplitEnumeratorContext to get the
>>>>>>>>>>>>> > > originally assigned splits when addReader() is invoked? I
>>>>>>>>>>>>> was assuming
>>>>>>>>>>>>> > the
>>>>>>>>>>>>> > > latter in the beginning, so the behavior of
>>>>>>>>>>>>> addSplitsBack() remains
>>>>>>>>>>>>> > > unchanged, but I am not opposed in doing the former.
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> > > Also, can you elaborate on the backwards compatibility
>>>>>>>>>>>>> issue you see if
>>>>>>>>>>>>> > we
>>>>>>>>>>>>> > > do not have a separate addSplitsBackOnRecovery() method?
>>>>>>>>>>>>> Even without
>>>>>>>>>>>>> > this
>>>>>>>>>>>>> > > new method, the behavior remains exactly the same unless
>>>>>>>>>>>>> the end users
>>>>>>>>>>>>> > > implement the mix-in interface of
>>>>>>>>>>>>> "SupportSplitReassignmentOnRecovery",
>>>>>>>>>>>>> > > right?
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> > > Thanks,
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> > > Jiangjie (Becket) Qin
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> > > On Mon, Sep 8, 2025 at 1:48 AM Hongshun Wang <
>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>> > > wrote:
>>>>>>>>>>>>> > >
>>>>>>>>>>>>> > > > Hi devs,
>>>>>>>>>>>>> > > >
>>>>>>>>>>>>> > > > It has been quite some time since this FLIP[1] was first
>>>>>>>>>>>>> proposed.
>>>>>>>>>>>>> > Thank
>>>>>>>>>>>>> > > > you for your valuable feedback—based on your
>>>>>>>>>>>>> suggestions, the FLIP has
>>>>>>>>>>>>> > > > undergone several rounds of revisions.
>>>>>>>>>>>>> > > >
>>>>>>>>>>>>> > > > Any more advice is welcome and appreciated. If there are
>>>>>>>>>>>>> no further
>>>>>>>>>>>>> > > > concerns, I plan to start the vote tomorrow.
>>>>>>>>>>>>> > > >
>>>>>>>>>>>>> > > > Best
>>>>>>>>>>>>> > > > Hongshun
>>>>>>>>>>>>> > > >
>>>>>>>>>>>>> > > > [1]
>>>>>>>>>>>>> > > >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=373886480
>>>>>>>>>>>>> > > >
>>>>>>>>>>>>> > > > On Mon, Sep 8, 2025 at 4:42 PM Hongshun Wang <
>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>> > > > wrote:
>>>>>>>>>>>>> > > >
>>>>>>>>>>>>> > > > > Hi Leonard,
>>>>>>>>>>>>> > > > > Thanks for your advice.  It makes sense and I have
>>>>>>>>>>>>> modified it.
>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>> > > > > Best,
>>>>>>>>>>>>> > > > > Hongshun
>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>> > > > > On Mon, Sep 8, 2025 at 11:40 AM Leonard Xu <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>> > > > >> Thanks Hongshun and Becket for the deep discussion.
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >> I only have one comment for one API design:
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >> > Deprecate the old addSplitsBack  method, use a
>>>>>>>>>>>>> addSplitsBack with
>>>>>>>>>>>>> > > > >> param isReportedByReader instead. Because, The
>>>>>>>>>>>>> enumerator can apply
>>>>>>>>>>>>> > > > >> different reassignment policies based on the context.
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >> Could we introduce a new method like
>>>>>>>>>>>>> *addSplitsBackOnRecovery*  with
>>>>>>>>>>>>> > > > default
>>>>>>>>>>>>> > > > >> implementation. In this way, we can provide better
>>>>>>>>>>>>> backward
>>>>>>>>>>>>> > > > >> compatibility and also makes it easier for developers
>>>>>>>>>>>>> to understand.
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >> Best,
>>>>>>>>>>>>> > > > >> Leonard
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >> 2025 9月 3 20:26，Hongshun Wang <[email protected]> 写道：
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >> Hi Becket,
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >> I think that's a great idea!  I have added the
>>>>>>>>>>>>> > > > >> SupportSplitReassignmentOnRecovery interface in this
>>>>>>>>>>>>> FLIP. If a
>>>>>>>>>>>>> > Source
>>>>>>>>>>>>> > > > >> implements this interface indicates that the source
>>>>>>>>>>>>> operator needs
>>>>>>>>>>>>> > to
>>>>>>>>>>>>> > > > >> report splits to the enumerator and receive
>>>>>>>>>>>>> reassignment.[1]
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >> Best,
>>>>>>>>>>>>> > > > >> Hongshun
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >> [1]
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-537%3A+Enumerator+with+Global+Split+Assignment+Distribution+for+Balanced+Split+assignment
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >> On Thu, Aug 21, 2025 at 12:09 PM Becket Qin <
>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>> > > > wrote:
>>>>>>>>>>>>> > > > >>
>>>>>>>>>>>>> > > > >>> Hi Hongshun,
>>>>>>>>>>>>> > > > >>>
>>>>>>>>>>>>> > > > >>> I think the convention for such optional features in
>>>>>>>>>>>>> Source is via
>>>>>>>>>>>>> > > > >>> mix-in interfaces. So instead of adding a method to
>>>>>>>>>>>>> the
>>>>>>>>>>>>> > SourceReader,
>>>>>>>>>>>>> > > > maybe
>>>>>>>>>>>>> > > > >>> we should introduce an interface
>>>>>>>>>>>>> SupportSplitReassingmentOnRecovery
>>>>>>>>>>>>> > > > with
>>>>>>>>>>>>> > > > >>> this method. If a Source implementation implements
>>>>>>>>>>>>> that interface,
>>>>>>>>>>>>> > > > then the
>>>>>>>>>>>>> > > > >>> SourceOperator will check the desired behavior and
>>>>>>>>>>>>> act accordingly.
>>>>>>>>>>>>> > > > >>>
>>>>>>>>>>>>> > > > >>> Thanks,
>>>>>>>>>>>>> > > > >>>
>>>>>>>>>>>>> > > > >>> Jiangjie (Becket) Qin
>>>>>>>>>>>>> > > > >>>
>>>>>>>>>>>>> > > > >>> On Wed, Aug 20, 2025 at 8:52 PM Hongshun Wang <
>>>>>>>>>>>>> > [email protected]
>>>>>>>>>>>>> > > > >
>>>>>>>>>>>>> > > > >>> wrote:
>>>>>>>>>>>>> > > > >>>
>>>>>>>>>>>>> > > > >>>> Hi de vs,
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> Would anyone like to discuss this FLIP? I'd
>>>>>>>>>>>>> appreciate your
>>>>>>>>>>>>> > feedback
>>>>>>>>>>>>> > > > >>>> and suggestions.
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> Best,
>>>>>>>>>>>>> > > > >>>> Hongshun
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> 2025年8月13日 14:23，Hongshun Wang <[email protected]>
>>>>>>>>>>>>> 写道：
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> Hi Becket,
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> Thank you for your detailed feedback. The new
>>>>>>>>>>>>> contract makes good
>>>>>>>>>>>>> > > > sense
>>>>>>>>>>>>> > > > >>>> to me and effectively addresses the issues I
>>>>>>>>>>>>> encountered at the
>>>>>>>>>>>>> > > > beginning
>>>>>>>>>>>>> > > > >>>> of the design.
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> That said, I recommend not reporting splits by
>>>>>>>>>>>>> default, primarily
>>>>>>>>>>>>> > for
>>>>>>>>>>>>> > > > >>>> compatibility and practical reasons:
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> >  For these reasons, we do not expect the Split
>>>>>>>>>>>>> objects to be
>>>>>>>>>>>>> > huge,
>>>>>>>>>>>>> > > > >>>> and we are not trying to design for huge Split
>>>>>>>>>>>>> objects either as
>>>>>>>>>>>>> > they
>>>>>>>>>>>>> > > > will
>>>>>>>>>>>>> > > > >>>> have problems even today.
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>>    1.
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>>    Not all existing connector match this rule
>>>>>>>>>>>>> > > > >>>>    For example, in mysql cdc connector, a binlog
>>>>>>>>>>>>> split may contain
>>>>>>>>>>>>> > > > >>>>    hundreds (or even more) snapshot split
>>>>>>>>>>>>> completion records. This
>>>>>>>>>>>>> > > > state is
>>>>>>>>>>>>> > > > >>>>    large and is currently transmitted incrementally
>>>>>>>>>>>>> through
>>>>>>>>>>>>> > multiple
>>>>>>>>>>>>> > > > >>>>    BinlogSplitMetaEvent messages. Since the binlog
>>>>>>>>>>>>> reader operates
>>>>>>>>>>>>> > > > >>>>    with single parallelism, reporting the full
>>>>>>>>>>>>> split state on
>>>>>>>>>>>>> > recovery
>>>>>>>>>>>>> > > > >>>>    could be inefficient or even infeasible.
>>>>>>>>>>>>> > > > >>>>    For such sources, it would be better to provide
>>>>>>>>>>>>> a mechanism to
>>>>>>>>>>>>> > skip
>>>>>>>>>>>>> > > > >>>>    split reporting during restart until they
>>>>>>>>>>>>> redesign and reduce
>>>>>>>>>>>>> > the
>>>>>>>>>>>>> > > > >>>>    split size.
>>>>>>>>>>>>> > > > >>>>    2.
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>>    Not all enumerators maintain unassigned splits
>>>>>>>>>>>>> in state.
>>>>>>>>>>>>> > > > >>>>    Some SplitEnumerator(such as kafka connector)
>>>>>>>>>>>>> implementations
>>>>>>>>>>>>> > do
>>>>>>>>>>>>> > > > >>>>    not track or persistently manage unassigned
>>>>>>>>>>>>> splits. Requiring
>>>>>>>>>>>>> > them
>>>>>>>>>>>>> > > > to
>>>>>>>>>>>>> > > > >>>>    handle re-registration would add unnecessary
>>>>>>>>>>>>> complexity. Even
>>>>>>>>>>>>> > > > though we
>>>>>>>>>>>>> > > > >>>>    maybe implements in kafka connector, currently,
>>>>>>>>>>>>> kafka connector
>>>>>>>>>>>>> > is
>>>>>>>>>>>>> > > > decouple
>>>>>>>>>>>>> > > > >>>>    with flink version, we also need to make sure
>>>>>>>>>>>>> the elder version
>>>>>>>>>>>>> > is
>>>>>>>>>>>>> > > > >>>>    compatible.
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> ------------------------------
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> To address these concerns, I propose introducing a
>>>>>>>>>>>>> new method:
>>>>>>>>>>>>> > boolean
>>>>>>>>>>>>> > > > >>>> SourceReader#shouldReassignSplitsOnRecovery() with
>>>>>>>>>>>>> a default
>>>>>>>>>>>>> > > > >>>> implementation returning false. This allows source
>>>>>>>>>>>>> readers to opt
>>>>>>>>>>>>> > in
>>>>>>>>>>>>> > > > >>>> to split reassignment only when necessary. Since
>>>>>>>>>>>>> the new contract
>>>>>>>>>>>>> > > > already
>>>>>>>>>>>>> > > > >>>> places the responsibility for split assignment on
>>>>>>>>>>>>> the enumerator,
>>>>>>>>>>>>> > not
>>>>>>>>>>>>> > > > >>>> reporting splits by default is a safe and clean
>>>>>>>>>>>>> default behavior.
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> ------------------------------
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> I’ve updated the implementation and the FIP
>>>>>>>>>>>>> accordingly[1]. It
>>>>>>>>>>>>> > quite a
>>>>>>>>>>>>> > > > >>>> big change. In particular, for the Kafka connector,
>>>>>>>>>>>>> we can now use
>>>>>>>>>>>>> > a
>>>>>>>>>>>>> > > > >>>> pluggable SplitPartitioner to support different
>>>>>>>>>>>>> split assignment
>>>>>>>>>>>>> > > > >>>> strategies (e.g., default, round-robin).
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> Could you please review it when you have a chance?
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> Best,
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> Hongshun
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> [1]
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-537%3A+Enumerator+with+Global+Split+Assignment+Distribution+for+Balanced+Split+assignment
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>> On Sat, Aug 9, 2025 at 3:03 AM Becket Qin <
>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>> > > > wrote:
>>>>>>>>>>>>> > > > >>>>
>>>>>>>>>>>>> > > > >>>>> Hi Hongshun,
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>> I am not too concerned about the transmission
>>>>>>>>>>>>> cost. Because the
>>>>>>>>>>>>> > full
>>>>>>>>>>>>> > > > >>>>> split transmission has to happen in the initial
>>>>>>>>>>>>> assignment phase
>>>>>>>>>>>>> > > > already.
>>>>>>>>>>>>> > > > >>>>> And in the future, we probably want to also
>>>>>>>>>>>>> introduce some kind
>>>>>>>>>>>>> > of
>>>>>>>>>>>>> > > > workload
>>>>>>>>>>>>> > > > >>>>> balance across source readers, e.g. based on the
>>>>>>>>>>>>> per-split
>>>>>>>>>>>>> > > > throughput or
>>>>>>>>>>>>> > > > >>>>> the per-source-reader workload in heterogeneous
>>>>>>>>>>>>> clusters. For
>>>>>>>>>>>>> > these
>>>>>>>>>>>>> > > > >>>>> reasons, we do not expect the Split objects to be
>>>>>>>>>>>>> huge, and we
>>>>>>>>>>>>> > are
>>>>>>>>>>>>> > > > not
>>>>>>>>>>>>> > > > >>>>> trying to design for huge Split objects either as
>>>>>>>>>>>>> they will have
>>>>>>>>>>>>> > > > problems
>>>>>>>>>>>>> > > > >>>>> even today.
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>> Good point on the potential split loss, please see
>>>>>>>>>>>>> the reply
>>>>>>>>>>>>> > below:
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>> Scenario 2:
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>>> 1. Reader A reports splits (1 and 2), and Reader
>>>>>>>>>>>>> B reports (3
>>>>>>>>>>>>> > and 4)
>>>>>>>>>>>>> > > > >>>>>> upon restart.
>>>>>>>>>>>>> > > > >>>>>> 2. Before the enumerator receives all reports and
>>>>>>>>>>>>> performs
>>>>>>>>>>>>> > > > >>>>>> reassignment, a checkpoint is triggered.
>>>>>>>>>>>>> > > > >>>>>> 3. Since no splits have been reassigned yet, both
>>>>>>>>>>>>> readers have
>>>>>>>>>>>>> > empty
>>>>>>>>>>>>> > > > >>>>>> states.
>>>>>>>>>>>>> > > > >>>>>> 4. When restarting from this checkpoint, all four
>>>>>>>>>>>>> splits are
>>>>>>>>>>>>> > lost.
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>> The reader registration happens in the
>>>>>>>>>>>>> SourceOperator.open(),
>>>>>>>>>>>>> > which
>>>>>>>>>>>>> > > > >>>>> means the task is still in the initializing state,
>>>>>>>>>>>>> therefore the
>>>>>>>>>>>>> > > > checkpoint
>>>>>>>>>>>>> > > > >>>>> should not be triggered until the enumerator
>>>>>>>>>>>>> receives all the
>>>>>>>>>>>>> > split
>>>>>>>>>>>>> > > > reports.
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>> There is a nuance here. Today, the RPC call from
>>>>>>>>>>>>> the TM to the JM
>>>>>>>>>>>>> > is
>>>>>>>>>>>>> > > > >>>>> async. So it is possible that the
>>>>>>>>>>>>> SourceOpertor.open() has
>>>>>>>>>>>>> > returned,
>>>>>>>>>>>>> > > > but
>>>>>>>>>>>>> > > > >>>>> the enumerator has not received the split reports.
>>>>>>>>>>>>> However,
>>>>>>>>>>>>> > because
>>>>>>>>>>>>> > > > the
>>>>>>>>>>>>> > > > >>>>> task status update RPC call goes to the same
>>>>>>>>>>>>> channel as the split
>>>>>>>>>>>>> > > > reports
>>>>>>>>>>>>> > > > >>>>> call, so the task status RPC call will happen
>>>>>>>>>>>>> after the split
>>>>>>>>>>>>> > > > reports call
>>>>>>>>>>>>> > > > >>>>> on the JM side. Therefore, on the JM side, the
>>>>>>>>>>>>> SourceCoordinator
>>>>>>>>>>>>> > will
>>>>>>>>>>>>> > > > >>>>> always first receive the split reports, then
>>>>>>>>>>>>> receive the
>>>>>>>>>>>>> > checkpoint
>>>>>>>>>>>>> > > > request.
>>>>>>>>>>>>> > > > >>>>> This "happen before" relationship is kind of
>>>>>>>>>>>>> important to
>>>>>>>>>>>>> > guarantee
>>>>>>>>>>>>> > > > >>>>> the consistent state between enumerator and
>>>>>>>>>>>>> readers.
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>> Scenario 1:
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>>> 1. Upon restart, Reader A reports assigned splits
>>>>>>>>>>>>> (1 and 2), and
>>>>>>>>>>>>> > > > >>>>>> Reader B reports (3 and 4).
>>>>>>>>>>>>> > > > >>>>>> 2. The enumerator receives these reports but only
>>>>>>>>>>>>> reassigns
>>>>>>>>>>>>> > splits 1
>>>>>>>>>>>>> > > > >>>>>> and 2 — not 3 and 4.
>>>>>>>>>>>>> > > > >>>>>> 3. A checkpoint or savepoint is then triggered.
>>>>>>>>>>>>> Only splits 1
>>>>>>>>>>>>> > and 2
>>>>>>>>>>>>> > > > >>>>>> are recorded in the reader states; splits 3 and 4
>>>>>>>>>>>>> are not
>>>>>>>>>>>>> > persisted.
>>>>>>>>>>>>> > > > >>>>>> 4. If the job is later restarted from this
>>>>>>>>>>>>> checkpoint, splits 3
>>>>>>>>>>>>> > and
>>>>>>>>>>>>> > > > 4
>>>>>>>>>>>>> > > > >>>>>> will be permanently lost.
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>> This scenario is possible. One solution is to let
>>>>>>>>>>>>> the enumerator
>>>>>>>>>>>>> > > > >>>>> implementation handle this. That means if the
>>>>>>>>>>>>> enumerator relies
>>>>>>>>>>>>> > on
>>>>>>>>>>>>> > > > the
>>>>>>>>>>>>> > > > >>>>> initial split reports from the source readers, it
>>>>>>>>>>>>> should maintain
>>>>>>>>>>>>> > > > these
>>>>>>>>>>>>> > > > >>>>> reports by itself. In the above example, the
>>>>>>>>>>>>> enumerator will need
>>>>>>>>>>>>> > to
>>>>>>>>>>>>> > > > >>>>> remember that 3 and 4 are not assigned and put it
>>>>>>>>>>>>> into its own
>>>>>>>>>>>>> > state.
>>>>>>>>>>>>> > > > >>>>> The current contract is that anything assigned to
>>>>>>>>>>>>> the
>>>>>>>>>>>>> > SourceReaders
>>>>>>>>>>>>> > > > >>>>> are completely owned by the SourceReaders.
>>>>>>>>>>>>> Enumerators can
>>>>>>>>>>>>> > remember
>>>>>>>>>>>>> > > > the
>>>>>>>>>>>>> > > > >>>>> assignments but cannot change them, even when the
>>>>>>>>>>>>> source reader
>>>>>>>>>>>>> > > > recovers /
>>>>>>>>>>>>> > > > >>>>> restarts.
>>>>>>>>>>>>> > > > >>>>> With this FLIP, the contract becomes that the
>>>>>>>>>>>>> source readers will
>>>>>>>>>>>>> > > > >>>>> return the ownership of the splits to the
>>>>>>>>>>>>> enumerator. So the
>>>>>>>>>>>>> > > > enumerator is
>>>>>>>>>>>>> > > > >>>>> responsible for maintaining these splits, until
>>>>>>>>>>>>> they are assigned
>>>>>>>>>>>>> > to
>>>>>>>>>>>>> > > > a
>>>>>>>>>>>>> > > > >>>>> source reader again.
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>> There are other cases where there may be conflict
>>>>>>>>>>>>> information
>>>>>>>>>>>>> > between
>>>>>>>>>>>>> > > > >>>>> reader and enumerator. For example, consider the
>>>>>>>>>>>>> following
>>>>>>>>>>>>> > sequence:
>>>>>>>>>>>>> > > > >>>>> 1. reader A reports splits (1 and 2) up on restart.
>>>>>>>>>>>>> > > > >>>>> 2. enumerator receives the report and assigns both
>>>>>>>>>>>>> 1 and 2 to
>>>>>>>>>>>>> > reader
>>>>>>>>>>>>> > > > B.
>>>>>>>>>>>>> > > > >>>>> 3. reader A failed before checkpointing. And this
>>>>>>>>>>>>> is a partial
>>>>>>>>>>>>> > > > >>>>> failure, so only reader A restarts.
>>>>>>>>>>>>> > > > >>>>> 4. When reader A recovers, it will again report
>>>>>>>>>>>>> splits (1 and 2)
>>>>>>>>>>>>> > to
>>>>>>>>>>>>> > > > >>>>> the enumerator.
>>>>>>>>>>>>> > > > >>>>> 5. The enumerator should ignore this report
>>>>>>>>>>>>> because it has
>>>>>>>>>>>>> > > > >>>>> assigned splits (1 and 2) to reader B.
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>> So with the new contract, the enumerator should be
>>>>>>>>>>>>> the source of
>>>>>>>>>>>>> > > > truth
>>>>>>>>>>>>> > > > >>>>> for split ownership.
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>> Thanks,
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>> On Fri, Aug 8, 2025 at 12:58 AM Hongshun Wang <
>>>>>>>>>>>>> > > > [email protected]>
>>>>>>>>>>>>> > > > >>>>> wrote:
>>>>>>>>>>>>> > > > >>>>>
>>>>>>>>>>>>> > > > >>>>>> Hi Becket,
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>> I did consider this approach at the beginning
>>>>>>>>>>>>> (and it was also
>>>>>>>>>>>>> > > > >>>>>> mentioned in this FLIP), since it would allow
>>>>>>>>>>>>> more flexibility
>>>>>>>>>>>>> > in
>>>>>>>>>>>>> > > > >>>>>> reassigning all splits. However, there are a few
>>>>>>>>>>>>> potential
>>>>>>>>>>>>> > issues.
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>> 1. High Transmission Cost
>>>>>>>>>>>>> > > > >>>>>> If we pass the full split objects (rather than
>>>>>>>>>>>>> just split IDs),
>>>>>>>>>>>>> > the
>>>>>>>>>>>>> > > > >>>>>> data size could be significant, leading to high
>>>>>>>>>>>>> overhead during
>>>>>>>>>>>>> > > > >>>>>> transmission — especially when many splits are
>>>>>>>>>>>>> involved.
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>> 2. Risk of Split Loss
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>> Risk of split loss exists unless we have a
>>>>>>>>>>>>> mechanism to make
>>>>>>>>>>>>> > sure
>>>>>>>>>>>>> > > > >>>>>> only can checkpoint after all the splits are
>>>>>>>>>>>>> reassigned.
>>>>>>>>>>>>> > > > >>>>>> There are scenarios where splits could be lost
>>>>>>>>>>>>> due to
>>>>>>>>>>>>> > inconsistent
>>>>>>>>>>>>> > > > >>>>>> state handling during recovery:
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>> Scenario 1:
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>>    1. Upon restart, Reader A reports assigned
>>>>>>>>>>>>> splits (1 and 2),
>>>>>>>>>>>>> > and
>>>>>>>>>>>>> > > > >>>>>>    Reader B reports (3 and 4).
>>>>>>>>>>>>> > > > >>>>>>    2. The enumerator receives these reports but
>>>>>>>>>>>>> only reassigns
>>>>>>>>>>>>> > > > >>>>>>    splits 1 and 2 — not 3 and 4.
>>>>>>>>>>>>> > > > >>>>>>    3. A checkpoint or savepoint is then
>>>>>>>>>>>>> triggered. Only splits 1
>>>>>>>>>>>>> > and
>>>>>>>>>>>>> > > > >>>>>>    2 are recorded in the reader states; splits 3
>>>>>>>>>>>>> and 4 are not
>>>>>>>>>>>>> > > > persisted.
>>>>>>>>>>>>> > > > >>>>>>    4. If the job is later restarted from this
>>>>>>>>>>>>> checkpoint, splits
>>>>>>>>>>>>> > 3
>>>>>>>>>>>>> > > > >>>>>>    and 4 will be permanently lost.
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>> Scenario 2:
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>>    1. Reader A reports splits (1 and 2), and
>>>>>>>>>>>>> Reader B reports (3
>>>>>>>>>>>>> > and
>>>>>>>>>>>>> > > > >>>>>>    4) upon restart.
>>>>>>>>>>>>> > > > >>>>>>    2. Before the enumerator receives all reports
>>>>>>>>>>>>> and performs
>>>>>>>>>>>>> > > > >>>>>>    reassignment, a checkpoint is triggered.
>>>>>>>>>>>>> > > > >>>>>>    3. Since no splits have been reassigned yet,
>>>>>>>>>>>>> both readers
>>>>>>>>>>>>> > have
>>>>>>>>>>>>> > > > >>>>>>    empty states.
>>>>>>>>>>>>> > > > >>>>>>    4. When restarting from this checkpoint, all
>>>>>>>>>>>>> four splits are
>>>>>>>>>>>>> > > > lost.
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>> Let me know if you have thoughts on how we might
>>>>>>>>>>>>> mitigate these
>>>>>>>>>>>>> > > > risks!
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>> Best
>>>>>>>>>>>>> > > > >>>>>> Hongshun
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>> On Fri, Aug 8, 2025 at 1:46 AM Becket Qin <
>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>> > > > >>>>>> wrote:
>>>>>>>>>>>>> > > > >>>>>>
>>>>>>>>>>>>> > > > >>>>>>> Hi Hongshun,
>>>>>>>>>>>>> > > > >>>>>>>
>>>>>>>>>>>>> > > > >>>>>>> The steps sound reasonable to me in general. In
>>>>>>>>>>>>> terms of the
>>>>>>>>>>>>> > > > updated
>>>>>>>>>>>>> > > > >>>>>>> FLIP wiki, it would be good to see if we can
>>>>>>>>>>>>> keep the protocol
>>>>>>>>>>>>> > > > simple. One
>>>>>>>>>>>>> > > > >>>>>>> alternative way to achieve this behavior is
>>>>>>>>>>>>> following:
>>>>>>>>>>>>> > > > >>>>>>>
>>>>>>>>>>>>> > > > >>>>>>> 1. Upon SourceOperator startup, the
>>>>>>>>>>>>> SourceOperator sends
>>>>>>>>>>>>> > > > >>>>>>> ReaderRegistrationEvent with the currently
>>>>>>>>>>>>> assigned splits to
>>>>>>>>>>>>> > the
>>>>>>>>>>>>> > > > >>>>>>> enumerator. It does not add these splits to the
>>>>>>>>>>>>> SourceReader.
>>>>>>>>>>>>> > > > >>>>>>> 2. The enumerator will always use the
>>>>>>>>>>>>> > > > >>>>>>> SourceEnumeratorContext.assignSplits() to assign
>>>>>>>>>>>>> the splits.
>>>>>>>>>>>>> > (not
>>>>>>>>>>>>> > > > via the
>>>>>>>>>>>>> > > > >>>>>>> response of the SourceRegistrationEvent, this
>>>>>>>>>>>>> allows async
>>>>>>>>>>>>> > split
>>>>>>>>>>>>> > > > assignment
>>>>>>>>>>>>> > > > >>>>>>> in case the enumerator wants to wait until all
>>>>>>>>>>>>> the readers are
>>>>>>>>>>>>> > > > registered)
>>>>>>>>>>>>> > > > >>>>>>> 3. The SourceOperator will only call
>>>>>>>>>>>>> SourceReader.addSplits()
>>>>>>>>>>>>> > when
>>>>>>>>>>>>> > > > >>>>>>> it receives the AddSplitEvent from the
>>>>>>>>>>>>> enumerator.
>>>>>>>>>>>>> > > > >>>>>>>
>>>>>>>>>>>>> > > > >>>>>>> This protocol has a few benefits:
>>>>>>>>>>>>> > > > >>>>>>> 1. it basically allows arbitrary split
>>>>>>>>>>>>> reassignment upon
>>>>>>>>>>>>> > restart
>>>>>>>>>>>>> > > > >>>>>>> 2. simplicity: there is only one way to assign
>>>>>>>>>>>>> splits.
>>>>>>>>>>>>> > > > >>>>>>>
>>>>>>>>>>>>> > > > >>>>>>> So we only need one interface change:
>>>>>>>>>>>>> > > > >>>>>>> - add the initially assigned splits to
>>>>>>>>>>>>> ReaderInfo so the
>>>>>>>>>>>>> > Enumerator
>>>>>>>>>>>>> > > > >>>>>>> can access it.
>>>>>>>>>>>>> > > > >>>>>>> and one behavior change:
>>>>>>>>>>>>> > > > >>>>>>> - The SourceOperator should stop assigning
>>>>>>>>>>>>> splits to the from
>>>>>>>>>>>>> > state
>>>>>>>>>>>>> > > > >>>>>>> restoration, but
>>>>>>>>>>>>> > [message truncated...]
>>>>>>>>>>>>> >
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>

Re: [DISCUSS] FLIP-537: Enumerator with Global Split Assignment Distribution for Balanced Split assignment

Reply via email to