lifulong opened a new issue, #10436:
URL: https://github.com/apache/incubator-gluten/issues/10436
### Backend
VL (Velox)
### Bug description
"E20250814 14:12:01.533819 3578786 Exceptions.h:70] Line:
/home/lifulong/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Task.cpp:2105,
Function:terminate, Expression: Cancelled, Source: RUNTIME, ErrorCode:
INVALID_STATE
I20250814 14:12:01.533937 3578786 Task.cpp:2117] Terminating task
Gluten_Stage_12_TID_123112_VTID_3 with state Canceled after running for 3m 48s"
The two line above is error msg in spark executor log, no more error info
find, i have try add some log while memory arbitrator or wait timeout to locate
root cause, but has no results, anyone has idea for further troubleshoot the
issue.
below is more information for run test:
spark sql always fail while 1.5G offheap per core
spark sql may fail with a certain probability while 3G offheap per core,
increase spark.task.maxFailures and spark.yarn.max.executor.failures config,
sql job always success.
the sql run with spark.gluten.sql.columnar.forceShuffledHashJoin=false
config, to use sort merge join, and will spill 500M data to disk per task.
### Gluten version
Gluten-1.4
### Spark version
Spark-3.5.x
### Spark configurations
_No response_
### System information
### Relevant logs
```bash
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]