andygrove opened a new issue, #2452:
URL: https://github.com/apache/datafusion-comet/issues/2452
### Describe the bug
In some configurations/environments, I see queries fail due to memory pool
requests being rejected, but I would expect Comet to spill to disk instead.
In one example, I am running TPC-H @ SF=1000 (1TB) in k8s. I am specifying
`spark.comet.exec.replaceSortMergeJoin=false` to force the use of
`CometSortMergeJoinExec`.
```
--conf spark.executor.instances=4 \
--conf spark.executor.cores=8 \
--conf spark.executor.memory=8G \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=4g \
```
I allocated 4g off off-heap memory, which equates to 512 MB per core.
I saw memory requests fail wit the memory pool limit at ~512MB.
I then doubled the off-heap memory, but still see the same issue but the
pool is now ~1GB. I would expect spilling to kick in instead.
```
org.apache.comet.CometNativeException: Additional allocation failed with top
memory consumers (across reservations) as:
ExternalSorter[107]#2991(can spill: true) consumed 1024.2 MB,
ExternalSorterMerge[107]#2990(can spill: false) consumed 16.7 MB,
GroupedHashAggregateStream[107] ()#2994(can spill: true) consumed 0.0 B,
GroupedHashAggregateStream[107] ()#2995(can spill: true) consumed 0.0 B,
ExternalSorterMerge[107]#2992(can spill: false) consumed 0.0 B,
```
I also see pods being killed due to OOM:
```
NAME READY STATUS
RESTARTS AGE
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-1 0/1
OOMKilled 0 11m
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-3 0/1
OOMKilled 0 11m
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-4 0/1
OOMKilled 0 11m
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-5 1/1 Running
0 4s
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-6 1/1 Running
0 4s
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-7 1/1 Running
0 3s
comet-benchmark-derived-from-tpch-a1133f997d91851b-exec-8 0/1
ContainerCreating 0 1s
```
I also see errors in the executor logs:
```
25/09/24 21:24:57 WARN ExecutionMemoryPool: Internal error: release called
on 917504 bytes but task only has 0 bytes of memory from the off-heap execution
pool
25/09/24 21:24:57 WARN ExecutionMemoryPool: Internal error: release called
on 839664 bytes but task only has 0 bytes of memory from the off-heap execution
pool
25/09/24 21:24:57 WARN ExecutionMemoryPool: Internal error: release called
on 917504 bytes but task only has 0 bytes of memory from the off-heap execution
pool
```
### Steps to reproduce
_No response_
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]