taiyang-li opened a new issue, #5823: URL: https://github.com/apache/incubator-gluten/issues/5823
### Backend CH (ClickHouse) ### Bug description When spark speculation execution is enabled, some tasks maybe killed by driver because another attempt had finished successfully. Those tasks failed because they are not exiting gracefully.  ### Spark version Spark-3.3.x ### Spark configurations ``` --conf spark.speculation=true --conf spark.speculation.multiplier=1.1 --conf spark.speculation.quantile=0.5 ``` ### System information _No response_ ### Relevant logs ```bash Task 1 logs 2024/05/21 11:10:57.941 ERROR [Executor task launch worker for task 5.1 in stage 23.0 (TID 2987)] spark.util.TaskResources: Task 2987 failed by error: org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: org.apache.spark.TaskKilledException: While executing SourceFromJavaIter 0. ./contrib/llvm-project/libcxx/include/exception:141: Poco::Exception::Exception(String const&, int) @ 0x0000000012241c39 1. ./build_gcc/./src/Common/Exception.cpp:101: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000aebb7bb 2. ./contrib/llvm-project/libcxx/include/string:1499: DB::Exception::createRuntime(int, String&) @ 0x00000000060a1b0c 3. ./utils/extern-local-engine/jni/jni_common.h:79: unsigned char local_engine::safeCallBooleanMethod<>(JNIEnv_*, _jobject*, _jmethodID*) @ 0x00000000060a2cab 4. ./build_gcc/./utils/extern-local-engine/Storages/SourceFromJavaIter.cpp:88: local_engine::SourceFromJavaIter::generate() @ 0x000000000b21c922 5. ./src/Processors/Chunk.h:90: DB::ISource::tryGenerate() @ 0x000000000f8cc5f5 6. ./contrib/llvm-project/libcxx/include/optional:344: DB::ISource::work() @ 0x000000000f8cc242 7. ./build_gcc/./src/Processors/Executors/ExecutionThreadContext.cpp:0: DB::ExecutionThreadContext::executeTask() @ 0x000000000f8e3302 8. ./build_gcc/./src/Processors/Executors/PipelineExecutor.cpp:272: DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x000000000f8da0b0 9. ./contrib/llvm-project/libcxx/include/atomic:958: DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x000000000f8d9c49 10. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:54: DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x000000000f8e8068 11. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:65: DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x000000000f8e8250 12. ./build_gcc/./utils/extern-local-engine/Parser/SerializedPlanParser.cpp:0: local_engine::LocalExecutor::hasNext() @ 0x000000000b1ecc16 13. ./build_gcc/./utils/extern-local-engine/local_engine_jni.cpp:342: Java_org_apache_gluten_vectorized_BatchIterator_nativeHasNext @ 0x000000000608aa37 at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:39) at org.apache.gluten.backendsapi.clickhouse.CHIteratorApi$$anon$2.hasNext(CHIteratorApi.scala:289) at org.apache.gluten.vectorized.CloseableCHColumnBatchIterator.$anonfun$hasNext$1(CloseableCHColumnBatchIterator.scala:42) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at org.apache.gluten.metrics.GlutenTimeMetric$.withNanoTime(GlutenTimeMetric.scala:41) at org.apache.gluten.vectorized.CloseableCHColumnBatchIterator.hasNext(CloseableCHColumnBatchIterator.scala:42) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:412) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$22(FileFormatWriter.scala:351) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.gluten.exception.GlutenException: org.apache.spark.TaskKilledException: While executing SourceFromJavaIter 0. ./contrib/llvm-project/libcxx/include/exception:141: Poco::Exception::Exception(String const&, int) @ 0x0000000012241c39 1. ./build_gcc/./src/Common/Exception.cpp:101: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000aebb7bb 2. ./contrib/llvm-project/libcxx/include/string:1499: DB::Exception::createRuntime(int, String&) @ 0x00000000060a1b0c 3. ./utils/extern-local-engine/jni/jni_common.h:79: unsigned char local_engine::safeCallBooleanMethod<>(JNIEnv_*, _jobject*, _jmethodID*) @ 0x00000000060a2cab 4. ./build_gcc/./utils/extern-local-engine/Storages/SourceFromJavaIter.cpp:88: local_engine::SourceFromJavaIter::generate() @ 0x000000000b21c922 5. ./src/Processors/Chunk.h:90: DB::ISource::tryGenerate() @ 0x000000000f8cc5f5 6. ./contrib/llvm-project/libcxx/include/optional:344: DB::ISource::work() @ 0x000000000f8cc242 7. ./build_gcc/./src/Processors/Executors/ExecutionThreadContext.cpp:0: DB::ExecutionThreadContext::executeTask() @ 0x000000000f8e3302 8. ./build_gcc/./src/Processors/Executors/PipelineExecutor.cpp:272: DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x000000000f8da0b0 9. ./contrib/llvm-project/libcxx/include/atomic:958: DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x000000000f8d9c49 10. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:54: DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x000000000f8e8068 11. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:65: DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x000000000f8e8250 12. ./build_gcc/./utils/extern-local-engine/Parser/SerializedPlanParser.cpp:0: local_engine::LocalExecutor::hasNext() @ 0x000000000b1ecc16 13. ./build_gcc/./utils/extern-local-engine/local_engine_jni.cpp:342: Java_org_apache_gluten_vectorized_BatchIterator_nativeHasNext @ 0x000000000608aa37 at org.apache.gluten.vectorized.BatchIterator.nativeHasNext(Native Method) at org.apache.gluten.vectorized.BatchIterator.hasNextInternal(BatchIterator.java:53) at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37) ... 16 more 2024-05-21 11:10:57.944 <Information> GraceMergingAggregatedTransform: Metrics. total_input_blocks: 0, total_input_rows: 0, total_output_blocks: 0, total_output_rows: 0, total_spill_disk_bytes: 0, total_spill_disk_time: 0, total_read_disk_time: 0, total_scatter_time: 0 2024-05-21 11:10:57.945 <Debug> MemoryTracker: Peak memory usage (for query): 187.29 MiB. 2024/05/21 11:10:57.945 INFO [Executor task launch worker for task 5.1 in stage 23.0 (TID 2987)] spark.executor.Executor: Executor interrupted and killed task 5.1 in stage 23.0 (TID 2987), reason: Stage finished 2024-05-21 11:11:03.806 <Debug> BroadCastJoinBuilder: Broadcast hash table BuiltHashTable-1854 is cleaned 2024-05-21 11:11:03.808 <Debug> BroadCastJoinBuilder: Broadcast hash table BuiltHashTable-1837 is cleaned 2024-05-21 11:11:03.809 <Debug> BroadCastJoinBuilder: Broadcast hash table BuiltHashTable-1846 is cleaned 2024/05/21 11:11:03.851 INFO [dispatcher-Executor] spark.executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown 2024/05/21 11:11:04.046 INFO [dispatcher-Executor] spark.executor.YarnCoarseGrainedExecutorBackend: Driver from hello-spark2.bigdata.bigo.inner:40109 disconnected during shutdown 2024/05/21 11:11:04.046 INFO [dispatcher-Executor] spark.executor.YarnCoarseGrainedExecutorBackend: Driver from hello-spark2.bigdata.bigo.inner:40109 disconnected during shutdown 2024/05/21 11:11:04.460 INFO [shutdown-hook-0] gluten.vectorized.JniLibLoader: Starting unload library path: /data3/hadoop/yarn/local/usercache/xumingyong/filecache/7668/libch.so 2024/05/21 11:11:04.461 INFO [shutdown-hook-0] gluten.vectorized.JniLibLoader: Finalizing library file: libch.so 2024-05-21 11:11:04.461 <Information> jni: start jni onUnload 2024-05-21 11:11:04.461 <Debug> Context: Destructing remote fs threadpool reader 2024-05-21 11:11:04.461 <Debug> Context: Destructing local fs threadpool reader 2024-05-21 11:11:04.461 <Debug> Context: Destructing local fs threadpool reader 2024/05/21 11:11:04.471 INFO [shutdown-hook-0] spark.util.ShutdownHookManager: Shutdown hook called ``` Task 2 logs ``` 2024/05/21 11:10:57.723 ERROR [Executor task launch worker for task 6.1 in stage 23.0 (TID 2988)] spark.util.TaskResources: Task 2988 failed by error: org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: org.apache.spark.TaskKilledException: While executing SourceFromJavaIter 0. ./contrib/llvm-project/libcxx/include/exception:141: Poco::Exception::Exception(String const&, int) @ 0x0000000012241c39 1. ./build_gcc/./src/Common/Exception.cpp:101: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000aebb7bb 2. ./contrib/llvm-project/libcxx/include/string:1499: DB::Exception::createRuntime(int, String&) @ 0x00000000060a1b0c 3. ./utils/extern-local-engine/jni/jni_common.h:79: unsigned char local_engine::safeCallBooleanMethod<>(JNIEnv_*, _jobject*, _jmethodID*) @ 0x00000000060a2cab 4. ./build_gcc/./utils/extern-local-engine/Storages/SourceFromJavaIter.cpp:88: local_engine::SourceFromJavaIter::generate() @ 0x000000000b21c922 5. ./src/Processors/Chunk.h:90: DB::ISource::tryGenerate() @ 0x000000000f8cc5f5 6. ./contrib/llvm-project/libcxx/include/optional:344: DB::ISource::work() @ 0x000000000f8cc242 7. ./build_gcc/./src/Processors/Executors/ExecutionThreadContext.cpp:0: DB::ExecutionThreadContext::executeTask() @ 0x000000000f8e3302 8. ./build_gcc/./src/Processors/Executors/PipelineExecutor.cpp:272: DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x000000000f8da0b0 9. ./contrib/llvm-project/libcxx/include/atomic:958: DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x000000000f8d9c49 10. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:54: DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x000000000f8e8068 11. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:65: DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x000000000f8e8250 12. ./build_gcc/./utils/extern-local-engine/Parser/SerializedPlanParser.cpp:0: local_engine::LocalExecutor::hasNext() @ 0x000000000b1ecc16 13. ./build_gcc/./utils/extern-local-engine/local_engine_jni.cpp:342: Java_org_apache_gluten_vectorized_BatchIterator_nativeHasNext @ 0x000000000608aa37 at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:39) at org.apache.gluten.backendsapi.clickhouse.CHIteratorApi$$anon$2.hasNext(CHIteratorApi.scala:289) at org.apache.gluten.vectorized.CloseableCHColumnBatchIterator.$anonfun$hasNext$1(CloseableCHColumnBatchIterator.scala:42) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at org.apache.gluten.metrics.GlutenTimeMetric$.withNanoTime(GlutenTimeMetric.scala:41) at org.apache.gluten.vectorized.CloseableCHColumnBatchIterator.hasNext(CloseableCHColumnBatchIterator.scala:42) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:412) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$22(FileFormatWriter.scala:351) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.gluten.exception.GlutenException: org.apache.spark.TaskKilledException: While executing SourceFromJavaIter 0. ./contrib/llvm-project/libcxx/include/exception:141: Poco::Exception::Exception(String const&, int) @ 0x0000000012241c39 1. ./build_gcc/./src/Common/Exception.cpp:101: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000aebb7bb 2. ./contrib/llvm-project/libcxx/include/string:1499: DB::Exception::createRuntime(int, String&) @ 0x00000000060a1b0c 3. ./utils/extern-local-engine/jni/jni_common.h:79: unsigned char local_engine::safeCallBooleanMethod<>(JNIEnv_*, _jobject*, _jmethodID*) @ 0x00000000060a2cab 4. ./build_gcc/./utils/extern-local-engine/Storages/SourceFromJavaIter.cpp:88: local_engine::SourceFromJavaIter::generate() @ 0x000000000b21c922 5. ./src/Processors/Chunk.h:90: DB::ISource::tryGenerate() @ 0x000000000f8cc5f5 6. ./contrib/llvm-project/libcxx/include/optional:344: DB::ISource::work() @ 0x000000000f8cc242 7. ./build_gcc/./src/Processors/Executors/ExecutionThreadContext.cpp:0: DB::ExecutionThreadContext::executeTask() @ 0x000000000f8e3302 8. ./build_gcc/./src/Processors/Executors/PipelineExecutor.cpp:272: DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x000000000f8da0b0 9. ./contrib/llvm-project/libcxx/include/atomic:958: DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x000000000f8d9c49 10. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:54: DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x000000000f8e8068 11. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:65: DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x000000000f8e8250 12. ./build_gcc/./utils/extern-local-engine/Parser/SerializedPlanParser.cpp:0: local_engine::LocalExecutor::hasNext() @ 0x000000000b1ecc16 13. ./build_gcc/./utils/extern-local-engine/local_engine_jni.cpp:342: Java_org_apache_gluten_vectorized_BatchIterator_nativeHasNext @ 0x000000000608aa37 at org.apache.gluten.vectorized.BatchIterator.nativeHasNext(Native Method) at org.apache.gluten.vectorized.BatchIterator.hasNextInternal(BatchIterator.java:53) at org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37) ... 16 more 2024-05-21 11:10:57.727 <Information> GraceMergingAggregatedTransform: Metrics. total_input_blocks: 82, total_input_rows: 721055, total_output_blocks: 0, total_output_rows: 0, total_spill_disk_bytes: 0, total_spill_disk_time: 0, total_read_disk_time: 0, total_scatter_time: 0 2024-05-21 11:10:57.730 <Debug> MemoryTracker: Peak memory usage (for query): 261.79 MiB. 2024/05/21 11:10:57.731 INFO [Executor task launch worker for task 6.1 in stage 23.0 (TID 2988)] spark.executor.Executor: Executor interrupted and killed task 6.1 in stage 23.0 (TID 2988), reason: another attempt succeeded 2024-05-21 11:11:03.804 <Debug> BroadCastJoinBuilder: Broadcast hash table BuiltHashTable-1854 is cleaned 2024-05-21 11:11:03.804 <Debug> BroadCastJoinBuilder: Broadcast hash table BuiltHashTable-1846 is cleaned 2024-05-21 11:11:03.805 <Debug> BroadCastJoinBuilder: Broadcast hash table BuiltHashTable-1837 is cleaned 2024/05/21 11:11:03.850 INFO [dispatcher-Executor] spark.executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown 2024/05/21 11:11:04.046 INFO [dispatcher-Executor] spark.executor.YarnCoarseGrainedExecutorBackend: Driver from hello-spark2.bigdata.bigo.inner:40109 disconnected during shutdown 2024/05/21 11:11:04.046 INFO [dispatcher-Executor] spark.executor.YarnCoarseGrainedExecutorBackend: Driver from hello-spark2.bigdata.bigo.inner:40109 disconnected during shutdown 2024/05/21 11:11:04.395 INFO [shutdown-hook-0] gluten.vectorized.JniLibLoader: Starting unload library path: /data10/hadoop/yarn/local/usercache/xumingyong/filecache/7824/libch.so 2024/05/21 11:11:04.396 INFO [shutdown-hook-0] gluten.vectorized.JniLibLoader: Finalizing library file: libch.so 2024-05-21 11:11:04.396 <Information> jni: start jni onUnload 2024-05-21 11:11:04.396 <Debug> Context: Destructing remote fs threadpool reader 2024-05-21 11:11:04.396 <Debug> Context: Destructing local fs threadpool reader 2024-05-21 11:11:04.396 <Debug> Context: Destructing local fs threadpool reader 2024/05/21 11:11:04.404 INFO [shutdown-hook-0] spark.util.ShutdownHookManager: Shutdown hook called ``` ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
