taiyang-li opened a new issue, #5823:
URL: https://github.com/apache/incubator-gluten/issues/5823

   ### Backend
   
   CH (ClickHouse)
   
   ### Bug description
   
   When spark speculation execution is enabled, some tasks maybe killed by 
driver because another attempt had finished successfully.
   Those tasks failed because they are not exiting gracefully.
   
   
![image](https://github.com/apache/incubator-gluten/assets/8181003/4cfb597d-135b-438e-9c05-785ffbced130)
    
   
   ### Spark version
   
   Spark-3.3.x
   
   ### Spark configurations
   
   ``` 
   --conf spark.speculation=true
       --conf spark.speculation.multiplier=1.1
       --conf spark.speculation.quantile=0.5
   ```
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   ```bash
   Task 1 logs
   
   2024/05/21 11:10:57.941 ERROR [Executor task launch worker for task 5.1 in 
stage 23.0 (TID 2987)] spark.util.TaskResources: Task 2987 failed by error: 
   org.apache.gluten.exception.GlutenException: 
org.apache.gluten.exception.GlutenException: 
org.apache.spark.TaskKilledException: While executing SourceFromJavaIter
   0. ./contrib/llvm-project/libcxx/include/exception:141: 
Poco::Exception::Exception(String const&, int) @ 0x0000000012241c39
   1. ./build_gcc/./src/Common/Exception.cpp:101: 
DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 
0x000000000aebb7bb
   2. ./contrib/llvm-project/libcxx/include/string:1499: 
DB::Exception::createRuntime(int, String&) @ 0x00000000060a1b0c
   3. ./utils/extern-local-engine/jni/jni_common.h:79: unsigned char 
local_engine::safeCallBooleanMethod<>(JNIEnv_*, _jobject*, _jmethodID*) @ 
0x00000000060a2cab
   4. 
./build_gcc/./utils/extern-local-engine/Storages/SourceFromJavaIter.cpp:88: 
local_engine::SourceFromJavaIter::generate() @ 0x000000000b21c922
   5. ./src/Processors/Chunk.h:90: DB::ISource::tryGenerate() @ 
0x000000000f8cc5f5
   6. ./contrib/llvm-project/libcxx/include/optional:344: DB::ISource::work() @ 
0x000000000f8cc242
   7. ./build_gcc/./src/Processors/Executors/ExecutionThreadContext.cpp:0: 
DB::ExecutionThreadContext::executeTask() @ 0x000000000f8e3302
   8. ./build_gcc/./src/Processors/Executors/PipelineExecutor.cpp:272: 
DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 
0x000000000f8da0b0
   9. ./contrib/llvm-project/libcxx/include/atomic:958: 
DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x000000000f8d9c49
   10. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:54: 
DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x000000000f8e8068
   11. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:65: 
DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x000000000f8e8250
   12. 
./build_gcc/./utils/extern-local-engine/Parser/SerializedPlanParser.cpp:0: 
local_engine::LocalExecutor::hasNext() @ 0x000000000b1ecc16
   13. ./build_gcc/./utils/extern-local-engine/local_engine_jni.cpp:342: 
Java_org_apache_gluten_vectorized_BatchIterator_nativeHasNext @ 
0x000000000608aa37
   
        at 
org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:39)
        at 
org.apache.gluten.backendsapi.clickhouse.CHIteratorApi$$anon$2.hasNext(CHIteratorApi.scala:289)
        at 
org.apache.gluten.vectorized.CloseableCHColumnBatchIterator.$anonfun$hasNext$1(CloseableCHColumnBatchIterator.scala:42)
        at 
scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
        at 
org.apache.gluten.metrics.GlutenTimeMetric$.withNanoTime(GlutenTimeMetric.scala:41)
        at 
org.apache.gluten.vectorized.CloseableCHColumnBatchIterator.hasNext(CloseableCHColumnBatchIterator.scala:42)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:412)
        at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$22(FileFormatWriter.scala:351)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.gluten.exception.GlutenException: 
org.apache.spark.TaskKilledException: While executing SourceFromJavaIter
   0. ./contrib/llvm-project/libcxx/include/exception:141: 
Poco::Exception::Exception(String const&, int) @ 0x0000000012241c39
   1. ./build_gcc/./src/Common/Exception.cpp:101: 
DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 
0x000000000aebb7bb
   2. ./contrib/llvm-project/libcxx/include/string:1499: 
DB::Exception::createRuntime(int, String&) @ 0x00000000060a1b0c
   3. ./utils/extern-local-engine/jni/jni_common.h:79: unsigned char 
local_engine::safeCallBooleanMethod<>(JNIEnv_*, _jobject*, _jmethodID*) @ 
0x00000000060a2cab
   4. 
./build_gcc/./utils/extern-local-engine/Storages/SourceFromJavaIter.cpp:88: 
local_engine::SourceFromJavaIter::generate() @ 0x000000000b21c922
   5. ./src/Processors/Chunk.h:90: DB::ISource::tryGenerate() @ 
0x000000000f8cc5f5
   6. ./contrib/llvm-project/libcxx/include/optional:344: DB::ISource::work() @ 
0x000000000f8cc242
   7. ./build_gcc/./src/Processors/Executors/ExecutionThreadContext.cpp:0: 
DB::ExecutionThreadContext::executeTask() @ 0x000000000f8e3302
   8. ./build_gcc/./src/Processors/Executors/PipelineExecutor.cpp:272: 
DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 
0x000000000f8da0b0
   9. ./contrib/llvm-project/libcxx/include/atomic:958: 
DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x000000000f8d9c49
   10. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:54: 
DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x000000000f8e8068
   11. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:65: 
DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x000000000f8e8250
   12. 
./build_gcc/./utils/extern-local-engine/Parser/SerializedPlanParser.cpp:0: 
local_engine::LocalExecutor::hasNext() @ 0x000000000b1ecc16
   13. ./build_gcc/./utils/extern-local-engine/local_engine_jni.cpp:342: 
Java_org_apache_gluten_vectorized_BatchIterator_nativeHasNext @ 
0x000000000608aa37
   
        at org.apache.gluten.vectorized.BatchIterator.nativeHasNext(Native 
Method)
        at 
org.apache.gluten.vectorized.BatchIterator.hasNextInternal(BatchIterator.java:53)
        at 
org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
        ... 16 more
   2024-05-21 11:10:57.944 <Information> GraceMergingAggregatedTransform: 
Metrics. total_input_blocks: 0, total_input_rows: 0, total_output_blocks: 0, 
total_output_rows: 0, total_spill_disk_bytes: 0, total_spill_disk_time: 0, 
total_read_disk_time: 0, total_scatter_time: 0
   2024-05-21 11:10:57.945 <Debug> MemoryTracker: Peak memory usage (for 
query): 187.29 MiB.
   2024/05/21 11:10:57.945 INFO [Executor task launch worker for task 5.1 in 
stage 23.0 (TID 2987)] spark.executor.Executor: Executor interrupted and killed 
task 5.1 in stage 23.0 (TID 2987), reason: Stage finished
   2024-05-21 11:11:03.806 <Debug> BroadCastJoinBuilder: Broadcast hash table 
BuiltHashTable-1854 is cleaned
   2024-05-21 11:11:03.808 <Debug> BroadCastJoinBuilder: Broadcast hash table 
BuiltHashTable-1837 is cleaned
   2024-05-21 11:11:03.809 <Debug> BroadCastJoinBuilder: Broadcast hash table 
BuiltHashTable-1846 is cleaned
   2024/05/21 11:11:03.851 INFO [dispatcher-Executor] 
spark.executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
   2024/05/21 11:11:04.046 INFO [dispatcher-Executor] 
spark.executor.YarnCoarseGrainedExecutorBackend: Driver from 
hello-spark2.bigdata.bigo.inner:40109 disconnected during shutdown
   2024/05/21 11:11:04.046 INFO [dispatcher-Executor] 
spark.executor.YarnCoarseGrainedExecutorBackend: Driver from 
hello-spark2.bigdata.bigo.inner:40109 disconnected during shutdown
   2024/05/21 11:11:04.460 INFO [shutdown-hook-0] 
gluten.vectorized.JniLibLoader: Starting unload library path: 
/data3/hadoop/yarn/local/usercache/xumingyong/filecache/7668/libch.so 
   2024/05/21 11:11:04.461 INFO [shutdown-hook-0] 
gluten.vectorized.JniLibLoader: Finalizing library file: libch.so
   2024-05-21 11:11:04.461 <Information> jni: start jni onUnload
   2024-05-21 11:11:04.461 <Debug> Context: Destructing remote fs threadpool 
reader
   2024-05-21 11:11:04.461 <Debug> Context: Destructing local fs threadpool 
reader
   2024-05-21 11:11:04.461 <Debug> Context: Destructing local fs threadpool 
reader
   2024/05/21 11:11:04.471 INFO [shutdown-hook-0] 
spark.util.ShutdownHookManager: Shutdown hook called
   ```
   
   
   Task 2 logs
   ```
   2024/05/21 11:10:57.723 ERROR [Executor task launch worker for task 6.1 in 
stage 23.0 (TID 2988)] spark.util.TaskResources: Task 2988 failed by error: 
   org.apache.gluten.exception.GlutenException: 
org.apache.gluten.exception.GlutenException: 
org.apache.spark.TaskKilledException: While executing SourceFromJavaIter
   0. ./contrib/llvm-project/libcxx/include/exception:141: 
Poco::Exception::Exception(String const&, int) @ 0x0000000012241c39
   1. ./build_gcc/./src/Common/Exception.cpp:101: 
DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 
0x000000000aebb7bb
   2. ./contrib/llvm-project/libcxx/include/string:1499: 
DB::Exception::createRuntime(int, String&) @ 0x00000000060a1b0c
   3. ./utils/extern-local-engine/jni/jni_common.h:79: unsigned char 
local_engine::safeCallBooleanMethod<>(JNIEnv_*, _jobject*, _jmethodID*) @ 
0x00000000060a2cab
   4. 
./build_gcc/./utils/extern-local-engine/Storages/SourceFromJavaIter.cpp:88: 
local_engine::SourceFromJavaIter::generate() @ 0x000000000b21c922
   5. ./src/Processors/Chunk.h:90: DB::ISource::tryGenerate() @ 
0x000000000f8cc5f5
   6. ./contrib/llvm-project/libcxx/include/optional:344: DB::ISource::work() @ 
0x000000000f8cc242
   7. ./build_gcc/./src/Processors/Executors/ExecutionThreadContext.cpp:0: 
DB::ExecutionThreadContext::executeTask() @ 0x000000000f8e3302
   8. ./build_gcc/./src/Processors/Executors/PipelineExecutor.cpp:272: 
DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 
0x000000000f8da0b0
   9. ./contrib/llvm-project/libcxx/include/atomic:958: 
DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x000000000f8d9c49
   10. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:54: 
DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x000000000f8e8068
   11. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:65: 
DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x000000000f8e8250
   12. 
./build_gcc/./utils/extern-local-engine/Parser/SerializedPlanParser.cpp:0: 
local_engine::LocalExecutor::hasNext() @ 0x000000000b1ecc16
   13. ./build_gcc/./utils/extern-local-engine/local_engine_jni.cpp:342: 
Java_org_apache_gluten_vectorized_BatchIterator_nativeHasNext @ 
0x000000000608aa37
   
        at 
org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:39)
        at 
org.apache.gluten.backendsapi.clickhouse.CHIteratorApi$$anon$2.hasNext(CHIteratorApi.scala:289)
        at 
org.apache.gluten.vectorized.CloseableCHColumnBatchIterator.$anonfun$hasNext$1(CloseableCHColumnBatchIterator.scala:42)
        at 
scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
        at 
org.apache.gluten.metrics.GlutenTimeMetric$.withNanoTime(GlutenTimeMetric.scala:41)
        at 
org.apache.gluten.vectorized.CloseableCHColumnBatchIterator.hasNext(CloseableCHColumnBatchIterator.scala:42)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:412)
        at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$22(FileFormatWriter.scala:351)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.gluten.exception.GlutenException: 
org.apache.spark.TaskKilledException: While executing SourceFromJavaIter
   0. ./contrib/llvm-project/libcxx/include/exception:141: 
Poco::Exception::Exception(String const&, int) @ 0x0000000012241c39
   1. ./build_gcc/./src/Common/Exception.cpp:101: 
DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 
0x000000000aebb7bb
   2. ./contrib/llvm-project/libcxx/include/string:1499: 
DB::Exception::createRuntime(int, String&) @ 0x00000000060a1b0c
   3. ./utils/extern-local-engine/jni/jni_common.h:79: unsigned char 
local_engine::safeCallBooleanMethod<>(JNIEnv_*, _jobject*, _jmethodID*) @ 
0x00000000060a2cab
   4. 
./build_gcc/./utils/extern-local-engine/Storages/SourceFromJavaIter.cpp:88: 
local_engine::SourceFromJavaIter::generate() @ 0x000000000b21c922
   5. ./src/Processors/Chunk.h:90: DB::ISource::tryGenerate() @ 
0x000000000f8cc5f5
   6. ./contrib/llvm-project/libcxx/include/optional:344: DB::ISource::work() @ 
0x000000000f8cc242
   7. ./build_gcc/./src/Processors/Executors/ExecutionThreadContext.cpp:0: 
DB::ExecutionThreadContext::executeTask() @ 0x000000000f8e3302
   8. ./build_gcc/./src/Processors/Executors/PipelineExecutor.cpp:272: 
DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 
0x000000000f8da0b0
   9. ./contrib/llvm-project/libcxx/include/atomic:958: 
DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x000000000f8d9c49
   10. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:54: 
DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x000000000f8e8068
   11. ./build_gcc/./src/Processors/Executors/PullingPipelineExecutor.cpp:65: 
DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x000000000f8e8250
   12. 
./build_gcc/./utils/extern-local-engine/Parser/SerializedPlanParser.cpp:0: 
local_engine::LocalExecutor::hasNext() @ 0x000000000b1ecc16
   13. ./build_gcc/./utils/extern-local-engine/local_engine_jni.cpp:342: 
Java_org_apache_gluten_vectorized_BatchIterator_nativeHasNext @ 
0x000000000608aa37
   
        at org.apache.gluten.vectorized.BatchIterator.nativeHasNext(Native 
Method)
        at 
org.apache.gluten.vectorized.BatchIterator.hasNextInternal(BatchIterator.java:53)
        at 
org.apache.gluten.vectorized.GeneralOutIterator.hasNext(GeneralOutIterator.java:37)
        ... 16 more
   2024-05-21 11:10:57.727 <Information> GraceMergingAggregatedTransform: 
Metrics. total_input_blocks: 82, total_input_rows: 721055, total_output_blocks: 
0, total_output_rows: 0, total_spill_disk_bytes: 0, total_spill_disk_time: 0, 
total_read_disk_time: 0, total_scatter_time: 0
   2024-05-21 11:10:57.730 <Debug> MemoryTracker: Peak memory usage (for 
query): 261.79 MiB.
   2024/05/21 11:10:57.731 INFO [Executor task launch worker for task 6.1 in 
stage 23.0 (TID 2988)] spark.executor.Executor: Executor interrupted and killed 
task 6.1 in stage 23.0 (TID 2988), reason: another attempt succeeded
   2024-05-21 11:11:03.804 <Debug> BroadCastJoinBuilder: Broadcast hash table 
BuiltHashTable-1854 is cleaned
   2024-05-21 11:11:03.804 <Debug> BroadCastJoinBuilder: Broadcast hash table 
BuiltHashTable-1846 is cleaned
   2024-05-21 11:11:03.805 <Debug> BroadCastJoinBuilder: Broadcast hash table 
BuiltHashTable-1837 is cleaned
   2024/05/21 11:11:03.850 INFO [dispatcher-Executor] 
spark.executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
   2024/05/21 11:11:04.046 INFO [dispatcher-Executor] 
spark.executor.YarnCoarseGrainedExecutorBackend: Driver from 
hello-spark2.bigdata.bigo.inner:40109 disconnected during shutdown
   2024/05/21 11:11:04.046 INFO [dispatcher-Executor] 
spark.executor.YarnCoarseGrainedExecutorBackend: Driver from 
hello-spark2.bigdata.bigo.inner:40109 disconnected during shutdown
   2024/05/21 11:11:04.395 INFO [shutdown-hook-0] 
gluten.vectorized.JniLibLoader: Starting unload library path: 
/data10/hadoop/yarn/local/usercache/xumingyong/filecache/7824/libch.so 
   2024/05/21 11:11:04.396 INFO [shutdown-hook-0] 
gluten.vectorized.JniLibLoader: Finalizing library file: libch.so
   2024-05-21 11:11:04.396 <Information> jni: start jni onUnload
   2024-05-21 11:11:04.396 <Debug> Context: Destructing remote fs threadpool 
reader
   2024-05-21 11:11:04.396 <Debug> Context: Destructing local fs threadpool 
reader
   2024-05-21 11:11:04.396 <Debug> Context: Destructing local fs threadpool 
reader
   2024/05/21 11:11:04.404 INFO [shutdown-hook-0] 
spark.util.ShutdownHookManager: Shutdown hook called
   ```
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to