[GitHub] [kyuubi] zhouyifan279 opened a new issue, #5136: [Bug] Some Spark App may hang forever when FinalStageResourceManager is enabled

via GitHub Sun, 06 Aug 2023 23:12:56 -0700


zhouyifan279 opened a new issue, #5136:
URL: https://github.com/apache/kyuubi/issues/5136


   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the 
[issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Describe the bug
   
   We found a Spark Application hanged at the final stage 
   <img width="2076" alt="image" 
src="https://github.com/apache/kyuubi/assets/88070094/70ded93a-abfd-4b8d-9b2b-5101c6297ec9";>
   
   Rerun the Application got the same result.
   
   ### Affects Version(s)
   
   master
   
   ### Kyuubi Server Log Output
   
   _No response_
   
   ### Kyuubi Engine Log Output
   
   ```logtalk
   2023-08-02 19:54:59 CST DAGScheduler INFO - ShuffleMapStage 4 (sql at 
SparkSQLExecute.java:17) finished in 279.363 s
   2023-08-02 19:54:59 CST YarnClusterScheduler INFO - Removed TaskSet 4.0, 
whose tasks have all completed, from pool default
   2023-08-02 19:54:59 CST DAGScheduler INFO - looking for newly runnable stages
   2023-08-02 19:54:59 CST DAGScheduler INFO - running: Set()
   2023-08-02 19:54:59 CST DAGScheduler INFO - waiting: Set()
   2023-08-02 19:54:59 CST DAGScheduler INFO - failed: Set()
   2023-08-02 19:54:59 CST YarnAllocator INFO - Resource profile 0 doesn't 
exist, adding it
   2023-08-02 19:54:59 CST YarnAllocator INFO - Driver requested a total number 
of 1 executor(s) for resource profile id: 0.
   2023-08-02 19:54:59 CST YarnClusterSchedulerBackend INFO - Requesting to 
kill executor(s) 99, 90, 84, 57, 63, 39, 30, 45, 66, 2, 72, 5, 48, 33, 69, 27, 
54, 60, 15, 42, 21, 71, 92, 86, 24, 74, 89, 95, 53, 41, 83, 56, 17, 1, 44, 50, 
23, 38, 4, 26, 11, 32, 82, 97, 29, 20, 85, 79, 70, 64, 91, 46, 94, 73, 67, 88, 
34, 28, 6, 40, 55, 76, 49, 61, 43, 9, 22, 58, 3, 10, 25, 93, 81, 75, 13
   2023-08-02 19:54:59 CST YarnClusterSchedulerBackend INFO - Actual list of 
executor(s) to be killed is 99, 90, 84, 57, 63, 39, 30, 45, 66, 2, 72, 5, 48, 
33, 69, 27, 54, 60, 15, 42, 21, 71, 92, 86, 24, 74, 89, 95, 53, 41, 83, 56, 17, 
1, 44, 50, 23, 38, 4, 26, 11, 32, 82, 97, 29, 20, 85, 79, 70, 64, 91, 46, 94, 
73, 67, 88, 34, 28, 6, 40, 55, 76, 49, 61, 43, 9, 22, 58, 3, 10, 25, 93, 81, 
75, 13
   2023-08-02 19:54:59 CST ApplicationMaster$AMEndpoint INFO - Driver requested 
to kill executor(s) 99, 90, 84, 57, 63, 39, 30, 45, 66, 2, 72, 5, 48, 33, 69, 
27, 54, 60, 15, 42, 21, 71, 92, 86, 24, 74, 89, 95, 53, 41, 83, 56, 17, 1, 44, 
50, 23, 38, 4, 26, 11, 32, 82, 97, 29, 20, 85, 79, 70, 64, 91, 46, 94, 73, 67, 
88, 34, 28, 6, 40, 55, 76, 49, 61, 43, 9, 22, 58, 3, 10, 25, 93, 81, 75, 13.
   2023-08-02 19:54:59 CST YarnAllocator INFO - Resource profile 0 doesn't 
exist, adding it
   2023-08-02 19:54:59 CST ExecutorAllocationManager INFO - Executors 
99,90,84,57,63,39,30,45,66,2,72,5,48,33,69,27,54,60,15,42,21,71,92,86,24,74,89,95,53,41,83,56,17,1,44,50,23,38,4,26,11,32,82,97,29,20,85,79,70,64,91,46,94,73,67,88,34,28,6,40,55,76,49,61,43,9,22,58,3,10,25,93,81,75,13
 removed due to idle timeout.
   2023-08-02 19:55:00 CST YarnClusterSchedulerBackend INFO - Requesting to 
kill executor(s) 65
   2023-08-02 19:55:00 CST YarnClusterSchedulerBackend INFO - Actual list of 
executor(s) to be killed is 65
   2023-08-02 19:55:00 CST ApplicationMaster$AMEndpoint INFO - Driver requested 
to kill executor(s) 65.
   2023-08-02 19:55:00 CST YarnAllocator INFO - Resource profile 0 doesn't 
exist, adding it
   2023-08-02 19:55:00 CST ExecutorAllocationManager INFO - Executors 65 
removed due to idle timeout.
   2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: 
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes to previousStage, 
original value: 128M 
   2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: 
set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes = 256M.
   2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: 
spark.sql.adaptive.skewJoin.skewedPartitionFactor to previousStage, original 
value: 4 
   2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: 
set spark.sql.adaptive.skewJoin.skewedPartitionFactor = 5.
   2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: 
spark.sql.adaptive.advisoryPartitionSizeInBytes to previousStage, original 
value: 8MB 
   2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: 
set spark.sql.adaptive.advisoryPartitionSizeInBytes = 384MB.
   2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: 
spark.sql.adaptive.coalescePartitions.minPartitionNum to previousStage, 
original value: __INTERNAL_UNSET_CONFIG_TAG__ 
   2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: 
set spark.sql.adaptive.coalescePartitions.minPartitionNum = 1.
   2023-08-02 19:55:00 CST ShufflePartitionsUtil INFO - For shuffle(2), 
advisory target size: 402653184, actual target size 21915537.
   2023-08-02 19:55:00 CST FinalStageResourceManager INFO - The snapshot of 
current executors view, active executors: 100, min executor: 1, target 
executors: 1, has benefits: true
   2023-08-02 19:55:01 CST YarnClusterSchedulerBackend INFO - Requesting to 
kill executor(s) 51, 19
   2023-08-02 19:55:01 CST YarnClusterSchedulerBackend INFO - Actual list of 
executor(s) to be killed is 51, 19
   2023-08-02 19:55:01 CST ApplicationMaster$AMEndpoint INFO - Driver requested 
to kill executor(s) 51, 19.
   2023-08-02 19:55:01 CST YarnAllocator INFO - Resource profile 0 doesn't 
exist, adding it
   2023-08-02 19:55:01 CST ExecutorAllocationManager INFO - Executors 51,19 
removed due to idle timeout.
   2023-08-02 19:55:02 CST FinalStageResourceManager INFO - Request to kill 
executors, total count 99, [88, 42, 77, 79, 2, 75, 81, 6, 15, 90, 28, 43, 63, 
64, 14, 93, 70, 21, 56, 34, 10, 33, 11, 65, 61, 57, 35, 18, 3, 7, 20, 17, 32, 
30, 68, 29, 86, 24, 47, 52, 38, 54, 41, 8, 9, 60, 40, 74, 4, 82, 100, 72, 45, 
69, 36, 12, 46, 58, 95, 80, 44, 87, 55, 53, 5, 23, 26, 22, 97, 85, 96, 66, 59, 
16, 84, 37, 48, 50, 51, 67, 39, 78, 62, 49, 71, 25, 13, 83, 89, 73, 31, 91, 19, 
1, 99, 92, 94, 98, 27].
   2023-08-02 19:55:02 CST YarnClusterSchedulerBackend INFO - Requesting to 
kill executor(s) 88, 42, 77, 79, 2, 75, 81, 6, 15, 90, 28, 43, 63, 64, 14, 93, 
70, 21, 56, 34, 10, 33, 11, 65, 61, 57, 35, 18, 3, 7, 20, 17, 32, 30, 68, 29, 
86, 24, 47, 52, 38, 54, 41, 8, 9, 60, 40, 74, 4, 82, 100, 72, 45, 69, 36, 12, 
46, 58, 95, 80, 44, 87, 55, 53, 5, 23, 26, 22, 97, 85, 96, 66, 59, 16, 84, 37, 
48, 50, 51, 67, 39, 78, 62, 49, 71, 25, 13, 83, 89, 73, 31, 91, 19, 1, 99, 92, 
94, 98, 27
   2023-08-02 19:55:02 CST YarnClusterSchedulerBackend INFO - Actual list of 
executor(s) to be killed is 77, 14, 35, 18, 7, 68, 47, 52, 8, 100, 36, 12, 80, 
87, 96, 59, 16, 37, 78, 62, 31, 98
   2023-08-02 19:55:02 CST YarnAllocator INFO - Resource profile 0 doesn't 
exist, adding it
   2023-08-02 19:55:02 CST YarnAllocator INFO - Driver requested a total number 
of 0 executor(s) for resource profile id: 0.
   2023-08-02 19:55:02 CST ApplicationMaster$AMEndpoint INFO - Driver requested 
to kill executor(s) 77, 14, 35, 18, 7, 68, 47, 52, 8, 100, 36, 12, 80, 87, 96, 
59, 16, 37, 78, 62, 31, 98.
   ```
   
   
   ### Kyuubi Server Configurations
   
   _No response_
   
   ### Kyuubi Engine Configurations
   
   _No response_
   
   ### Additional context
   
   Spark DRA was enabled and spark.dynamicAllocation.minExecutors was set to 1
   
   
   
   ### Are you willing to submit PR?
   
   - [X] Yes. I would be willing to submit a PR with guidance from the Kyuubi 
community to fix.
   - [X] No. I cannot submit a PR at this time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [kyuubi] zhouyifan279 opened a new issue, #5136: [Bug] Some Spark App may hang forever when FinalStageResourceManager is enabled

Reply via email to