mixermt opened a new issue, #1714:
URL: https://github.com/apache/datafusion-comet/issues/1714

   Hi,
   
   Experience occasional failure of Spark executors 
   
   ```
   │ # A fatal error has been detected by the Java Runtime Environment:         
                                                                                
                                                        │
   │ #                                                                          
                                                                                
                                                        │
   │ #  SIGSEGV (0xb) at pc=0x00007f079663f84e, pid=18, tid=0x00007f07347ff700  
                                                                                
                                                        │
   │ #                                                                          
                                                                                
                                                        │
   │ # JRE version: OpenJDK Runtime Environment (Zulu 8.74.0.17-CA-linux64) 
(8.0_392-b08) (build 1.8.0_392-b08)                                             
                                                            │
   │ # Java VM: OpenJDK 64-Bit Server VM (25.392-b08 mixed mode linux-amd64 
compressed oops)                                                                
                                                            │
   │ # Problematic frame:                                                       
                                                                                
                                                        │
   │ # V  [libjvm.so+0x8a584e]  MallocSiteTable::malloc_site(unsigned long, 
unsigned long)+0xe                                                              
                                                            │
   │ #                                                                          
                                                                                
                                                        │
   │ # Core dump written. Default location: /opt/spark/work-dir/core or core.18 
                                                                                
                                                        │
   │ #                                                                          
                                                                                
                                                        │
   │ # An error report file with more information is saved as:                  
                                                                                
                                                        │
   │ # /opt/spark/work-dir/hs_err_pid18.log                                     
                                                                                
                                                        │
   │ [thread 139669100558080 also had an error]                                 
                                                                                
                                                        │
   │ [thread 139669096355584 also had an error]                                 
                                                                                
                                                        │
   │ #                                                                          
                                                                                
                                                        │
   │ # If you would like to submit a bug report, please visit:                  
                                                                                
                                                        │
   │ #   http://www.azul.com/support/                                           
                                                                                
                                                        │
   │ #        
   ```
   
   From Spark UI 
   ```
   ExecutorLostFailure (executor 7 exited caused by one of the running tasks) 
Reason: 
   The executor with id 7 exited with exit code -1(unexpected).
   
   The API gave the following container statuses:
         container name: spark-executor
         container image: OUR_SPARK_DOCKER_IMAGE 
         container state: terminated
         container started at: 2025-05-04T12:16:23Z
         container finished at: 2025-05-04T12:17:14Z
         exit code: 134
         termination reason: Error
   ```
   
   First I thought it sounds like OOM but when I've checked memory graphs of 
the pods, none of the pods reached even half of requested memory. 
   After number of retries the job succeeded with execution after it switched 
to another executor.
   The input bytes or shuffle are really small comparing to allocated executor 
memory and ofHeap (50g and 30g)
   
   <img width="1506" alt="Image" 
src="https://github.com/user-attachments/assets/4ac2f2d2-3a13-4e2a-b2a6-e6dd3a0cffc7";
 />
    
   
   Our env:
   Spark 3.5.4 - Comet version 0.8.0
   
   
   Any ideas ? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to