zhouyifan279 opened a new issue, #5136: URL: https://github.com/apache/kyuubi/issues/5136
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no similar issues. ### Describe the bug We found a Spark Application hanged at the final stage <img width="2076" alt="image" src="https://github.com/apache/kyuubi/assets/88070094/70ded93a-abfd-4b8d-9b2b-5101c6297ec9"> Rerun the Application got the same result. ### Affects Version(s) master ### Kyuubi Server Log Output _No response_ ### Kyuubi Engine Log Output ```logtalk 2023-08-02 19:54:59 CST DAGScheduler INFO - ShuffleMapStage 4 (sql at SparkSQLExecute.java:17) finished in 279.363 s 2023-08-02 19:54:59 CST YarnClusterScheduler INFO - Removed TaskSet 4.0, whose tasks have all completed, from pool default 2023-08-02 19:54:59 CST DAGScheduler INFO - looking for newly runnable stages 2023-08-02 19:54:59 CST DAGScheduler INFO - running: Set() 2023-08-02 19:54:59 CST DAGScheduler INFO - waiting: Set() 2023-08-02 19:54:59 CST DAGScheduler INFO - failed: Set() 2023-08-02 19:54:59 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it 2023-08-02 19:54:59 CST YarnAllocator INFO - Driver requested a total number of 1 executor(s) for resource profile id: 0. 2023-08-02 19:54:59 CST YarnClusterSchedulerBackend INFO - Requesting to kill executor(s) 99, 90, 84, 57, 63, 39, 30, 45, 66, 2, 72, 5, 48, 33, 69, 27, 54, 60, 15, 42, 21, 71, 92, 86, 24, 74, 89, 95, 53, 41, 83, 56, 17, 1, 44, 50, 23, 38, 4, 26, 11, 32, 82, 97, 29, 20, 85, 79, 70, 64, 91, 46, 94, 73, 67, 88, 34, 28, 6, 40, 55, 76, 49, 61, 43, 9, 22, 58, 3, 10, 25, 93, 81, 75, 13 2023-08-02 19:54:59 CST YarnClusterSchedulerBackend INFO - Actual list of executor(s) to be killed is 99, 90, 84, 57, 63, 39, 30, 45, 66, 2, 72, 5, 48, 33, 69, 27, 54, 60, 15, 42, 21, 71, 92, 86, 24, 74, 89, 95, 53, 41, 83, 56, 17, 1, 44, 50, 23, 38, 4, 26, 11, 32, 82, 97, 29, 20, 85, 79, 70, 64, 91, 46, 94, 73, 67, 88, 34, 28, 6, 40, 55, 76, 49, 61, 43, 9, 22, 58, 3, 10, 25, 93, 81, 75, 13 2023-08-02 19:54:59 CST ApplicationMaster$AMEndpoint INFO - Driver requested to kill executor(s) 99, 90, 84, 57, 63, 39, 30, 45, 66, 2, 72, 5, 48, 33, 69, 27, 54, 60, 15, 42, 21, 71, 92, 86, 24, 74, 89, 95, 53, 41, 83, 56, 17, 1, 44, 50, 23, 38, 4, 26, 11, 32, 82, 97, 29, 20, 85, 79, 70, 64, 91, 46, 94, 73, 67, 88, 34, 28, 6, 40, 55, 76, 49, 61, 43, 9, 22, 58, 3, 10, 25, 93, 81, 75, 13. 2023-08-02 19:54:59 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it 2023-08-02 19:54:59 CST ExecutorAllocationManager INFO - Executors 99,90,84,57,63,39,30,45,66,2,72,5,48,33,69,27,54,60,15,42,21,71,92,86,24,74,89,95,53,41,83,56,17,1,44,50,23,38,4,26,11,32,82,97,29,20,85,79,70,64,91,46,94,73,67,88,34,28,6,40,55,76,49,61,43,9,22,58,3,10,25,93,81,75,13 removed due to idle timeout. 2023-08-02 19:55:00 CST YarnClusterSchedulerBackend INFO - Requesting to kill executor(s) 65 2023-08-02 19:55:00 CST YarnClusterSchedulerBackend INFO - Actual list of executor(s) to be killed is 65 2023-08-02 19:55:00 CST ApplicationMaster$AMEndpoint INFO - Driver requested to kill executor(s) 65. 2023-08-02 19:55:00 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it 2023-08-02 19:55:00 CST ExecutorAllocationManager INFO - Executors 65 removed due to idle timeout. 2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes to previousStage, original value: 128M 2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes = 256M. 2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: spark.sql.adaptive.skewJoin.skewedPartitionFactor to previousStage, original value: 4 2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: set spark.sql.adaptive.skewJoin.skewedPartitionFactor = 5. 2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: spark.sql.adaptive.advisoryPartitionSizeInBytes to previousStage, original value: 8MB 2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: set spark.sql.adaptive.advisoryPartitionSizeInBytes = 384MB. 2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: spark.sql.adaptive.coalescePartitions.minPartitionNum to previousStage, original value: __INTERNAL_UNSET_CONFIG_TAG__ 2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: set spark.sql.adaptive.coalescePartitions.minPartitionNum = 1. 2023-08-02 19:55:00 CST ShufflePartitionsUtil INFO - For shuffle(2), advisory target size: 402653184, actual target size 21915537. 2023-08-02 19:55:00 CST FinalStageResourceManager INFO - The snapshot of current executors view, active executors: 100, min executor: 1, target executors: 1, has benefits: true 2023-08-02 19:55:01 CST YarnClusterSchedulerBackend INFO - Requesting to kill executor(s) 51, 19 2023-08-02 19:55:01 CST YarnClusterSchedulerBackend INFO - Actual list of executor(s) to be killed is 51, 19 2023-08-02 19:55:01 CST ApplicationMaster$AMEndpoint INFO - Driver requested to kill executor(s) 51, 19. 2023-08-02 19:55:01 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it 2023-08-02 19:55:01 CST ExecutorAllocationManager INFO - Executors 51,19 removed due to idle timeout. 2023-08-02 19:55:02 CST FinalStageResourceManager INFO - Request to kill executors, total count 99, [88, 42, 77, 79, 2, 75, 81, 6, 15, 90, 28, 43, 63, 64, 14, 93, 70, 21, 56, 34, 10, 33, 11, 65, 61, 57, 35, 18, 3, 7, 20, 17, 32, 30, 68, 29, 86, 24, 47, 52, 38, 54, 41, 8, 9, 60, 40, 74, 4, 82, 100, 72, 45, 69, 36, 12, 46, 58, 95, 80, 44, 87, 55, 53, 5, 23, 26, 22, 97, 85, 96, 66, 59, 16, 84, 37, 48, 50, 51, 67, 39, 78, 62, 49, 71, 25, 13, 83, 89, 73, 31, 91, 19, 1, 99, 92, 94, 98, 27]. 2023-08-02 19:55:02 CST YarnClusterSchedulerBackend INFO - Requesting to kill executor(s) 88, 42, 77, 79, 2, 75, 81, 6, 15, 90, 28, 43, 63, 64, 14, 93, 70, 21, 56, 34, 10, 33, 11, 65, 61, 57, 35, 18, 3, 7, 20, 17, 32, 30, 68, 29, 86, 24, 47, 52, 38, 54, 41, 8, 9, 60, 40, 74, 4, 82, 100, 72, 45, 69, 36, 12, 46, 58, 95, 80, 44, 87, 55, 53, 5, 23, 26, 22, 97, 85, 96, 66, 59, 16, 84, 37, 48, 50, 51, 67, 39, 78, 62, 49, 71, 25, 13, 83, 89, 73, 31, 91, 19, 1, 99, 92, 94, 98, 27 2023-08-02 19:55:02 CST YarnClusterSchedulerBackend INFO - Actual list of executor(s) to be killed is 77, 14, 35, 18, 7, 68, 47, 52, 8, 100, 36, 12, 80, 87, 96, 59, 16, 37, 78, 62, 31, 98 2023-08-02 19:55:02 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it 2023-08-02 19:55:02 CST YarnAllocator INFO - Driver requested a total number of 0 executor(s) for resource profile id: 0. 2023-08-02 19:55:02 CST ApplicationMaster$AMEndpoint INFO - Driver requested to kill executor(s) 77, 14, 35, 18, 7, 68, 47, 52, 8, 100, 36, 12, 80, 87, 96, 59, 16, 37, 78, 62, 31, 98. ``` ### Kyuubi Server Configurations _No response_ ### Kyuubi Engine Configurations _No response_ ### Additional context Spark DRA was enabled and spark.dynamicAllocation.minExecutors was set to 1 ### Are you willing to submit PR? - [X] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix. - [X] No. I cannot submit a PR at this time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
