andresti opened a new issue, #37930:
URL: https://github.com/apache/beam/issues/37930

   ### What happened?
   
   ## What happened?                                                            
                                                                                
                                                                               
                     
   **ProcessManager.stopProcess()** calls **destroy()/destroyForcibly()** to 
terminate child processes but never calls **Process.waitFor()** to collect the 
exit status. On POSIX systems, this leaves the terminated child as a zombie 
(state Z/defunct) in the kernel process table until the parent process exits.
                                                                                
                                                                                
                                                                              
   In long-running environments like Flink TaskManagers using 
**--environment_type=PROCESS**, expansion service processes 
(**/opt/apache/beam/java_boot**) are repeatedly spawned and stopped but never 
reaped. Over time this leads to significant zombie accumulation — we observed 
176+ zombie Java processes on production Flink TaskManager pods.
                                                                                
                                                                                
                                                                              
    Container-level init systems (e.g. dumb-init, tini) cannot help because the 
zombies are children of the still-running Java TaskManager process — only the 
parent can reap its own children.                                              
      
   ## **How to reproduce**                                                      
                                                                                
                                                                                
   
                     
     1. Run a Flink pipeline with --environment_type=PROCESS                    
                                                                                
                                                                              
     2. Let it process work for an extended period (hours/days)
     3. Check for zombie processes: ps aux | grep defunct                       
                                                                                
                                                                              
                                                                                
                                                                                
                                                                              
   Zombies will accumulate over time as expansion service processes are started 
and stopped without being reaped.                                               
                                                                            
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [x] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to