scwhittle opened a new issue, #32714:
URL: https://github.com/apache/beam/issues/32714

   ### What happened?
   
   Java SDK pipelines using the FnApi can become wedged if the single data 
stream is blocked attempting to queue records for a process bundle handler 
which was never registered.
   
![image](https://github.com/user-attachments/assets/40462db2-3ebd-4080-82a0-a70769b1d7a9)
   
   It seems like this same stuckness could also occur with the error handling 
if we did create the bundle handler successfully but encountered an exception 
during processing and unregistered the instruction id and then had more data 
show up for the instruction id. The healthy case ensures that the data for the 
instruction is drained before returning and removing the registration. 
   
![image](https://github.com/user-attachments/assets/aba78a21-2d1b-424e-85b6-547a18f0f957)
   
   There could be a control stream issue where the instruction id doesn't 
arrive but I think that there is a race with error handling that could explain 
why the future is never notified. It seems possible that a handler for the 
instruction id was recieved and registered but then was removed due to 
processing completing with an error. If an element on the data stream arrives 
for the instruction id after this point it will create a new entry in the map 
for the instruction id which was removed, and will block forever because the 
instruction id will not be registered and have it's future notified.
   
![image](https://github.com/user-attachments/assets/2f74f6d4-f03e-44c3-9865-b740de9a8cad)
   
   To fix I think we could:
   1. catch and retry the exception creating the bundle processor
   2. add a timeout to the data multiplexing waiting for the instruction to 
arrive. If the timeout is exceeded we'd have to error out the data stream 
causing SDK restart but that is preferrable to being stuck forever. We could 
reduce this triggering for the identified races above by having a cache of 
recently errored instructions.
   
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [X] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to