Xiao-zhen-Liu opened a new issue, #4556:
URL: https://github.com/apache/texera/issues/4556

   ### What happened?
   
   Region execution coordination can schedule the next region before the 
previous region's workers are fully terminated.
   
   This can lead to unsafe interleavings between region termination, 
next-region startup, and workflow completion. In particular, when `endWorker` 
is sent while a worker still has queued messages, the worker rejects 
termination with errors like:
   
   `Received EndHandler before all messages are processed`
   
   In the synchronous kill path, workflow completion can also be emitted during 
the gap where the previous initialized region is completed but the next pending 
region has not been initialized yet.
   
   The coordinator should not start the next region until the previous region's 
workers have acknowledged `endWorker` and have been gracefully stopped.
   
   If `endWorker` fails because the worker still has queued messages, 
termination should retry until the worker queue is drained and `endWorker` 
succeeds. Workflow completion should only be emitted after all scheduled 
regions have finished.
   
   
   ### How to reproduce?
   
   Run a workflow with multiple scheduled regions where the first region 
completes while workers may still have queued control messages, such as 
statistics query replies or other control messages.
   
   Steps:
   1. Start a workflow that is split into at least two regions.
   2. Let the first region complete.
   3. Observe that the coordinator can attempt to start the second region 
before the first region's worker termination is fully complete.
   4. In some runs, observe worker termination failures like:
      `Received EndHandler before all messages are processed`
   5. In some runs, observe premature workflow completion before the next 
region has fully launched or finished.
   
   
   ### Version
   
   1.1.0-incubating (Pre-release/Master)
   
   ### Commit Hash (Optional)
   
   _No response_
   
   ### What browsers are you seeing the problem on?
   
   _No response_
   
   ### Relevant log output
   
   ```shell
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to