andygrove opened a new issue, #2452:
URL: https://github.com/apache/datafusion-comet/issues/2452

   ### Describe the bug
   
   In some configurations/environments, I see queries fail due to memory pool 
requests being rejected, but I would expect Comet to spill to disk instead.
   
   In one example, I am running TPC-H @ SF=1000 (1TB) in k8s. I am specifying 
`spark.comet.exec.replaceSortMergeJoin=false` to force the use of 
`CometSortMergeJoinExec`.
   
   ```
       --conf spark.executor.instances=4 \
       --conf spark.executor.cores=8 \
       --conf spark.executor.memory=8G \
       --conf spark.memory.offHeap.enabled=true \
       --conf spark.memory.offHeap.size=4g \
   ```
   
   I allocated 4g off off-heap memory, which equates to 512 MB per core.
   
   I saw memory requests fail wit the memory pool limit at ~512MB.
   
   I then doubled the off-heap memory, but still see the same issue but the 
pool is now ~1GB. I would expect spilling to kick in instead.
   
   ```
   org.apache.comet.CometNativeException: Additional allocation failed with top 
memory consumers (across reservations) as:
     ExternalSorter[107]#2991(can spill: true) consumed 1024.2 MB,
     ExternalSorterMerge[107]#2990(can spill: false) consumed 16.7 MB,
     GroupedHashAggregateStream[107] ()#2994(can spill: true) consumed 0.0 B,
     GroupedHashAggregateStream[107] ()#2995(can spill: true) consumed 0.0 B,
     ExternalSorterMerge[107]#2992(can spill: false) consumed 0.0 B,
   ```
   
   
   I also see pods being killed due to OOM:
   
   ```
   NAME                                                        READY   STATUS   
           RESTARTS   AGE
   comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-1   0/1     
OOMKilled           0          11m
   comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-3   0/1     
OOMKilled           0          11m
   comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-4   0/1     
OOMKilled           0          11m
   comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-5   1/1     Running  
           0          4s
   comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-6   1/1     Running  
           0          4s
   comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-7   1/1     Running  
           0          3s
   comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-8   0/1     
ContainerCreating   0          1s
   ```
   
   I also see errors in the executor logs:
   
   ```
   25/09/24 21:24:57 WARN ExecutionMemoryPool: Internal error: release called 
on 917504 bytes but task only has 0 bytes of memory from the off-heap execution 
pool
   25/09/24 21:24:57 WARN ExecutionMemoryPool: Internal error: release called 
on 839664 bytes but task only has 0 bytes of memory from the off-heap execution 
pool
   25/09/24 21:24:57 WARN ExecutionMemoryPool: Internal error: release called 
on 917504 bytes but task only has 0 bytes of memory from the off-heap execution 
pool
   ```
   
   
   ### Steps to reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to