attilapiros commented on pull request #28331:
URL: https://github.com/apache/spark/pull/28331#issuecomment-634530744


   There are two issues I can see.
   
   1) Locally (at my side) the `verify that an already running task which is 
going to cache data succeeds on a decommissioned executor` always fails.
   
   It fails with sbt and with maven too. Last time I tried with:
   
   ```shell
   $ ./build/mvn clean install -DskipTests
   $ ./build/mvn -pl :spark-core_2.12 test -Dtest=none 
'-Dsuites=*.BlockManagerDecommissionSuite'
   ``` 
   
   The failure is:
   
   ```
   Run starting. Expected test count is: 2
   BlockManagerDecommissionSuite:
   - verify that an already running task which is going to cache data succeeds 
on a decommissioned executor *** FAILED ***
     Vector(rdd_1_0) did not equal List() 
(BlockManagerDecommissionSuite.scala:141)
   - verify that shuffle blocks are migrated.
   Run completed in 26 seconds, 757 milliseconds.
   Total number of tests run: 2
   Suites: completed 1, aborted 0
   Tests: succeeded 1, failed 1, canceled 0, ignored 0, pending 0
   *** 1 TEST FAILED ***
   ```
   
   The relevant log from `core/target/unit-tests.log`:
   ```
   20/05/27 00:06:11.432 ScalaTest-main-running-BlockManagerDecommissionSuite 
INFO StandaloneSchedulerBackend: Propagating executor decommission to driver.
   20/05/27 00:06:11.433 dispatcher-CoarseGrainedScheduler ERROR 
CoarseGrainedSchedulerBackend$DriverEndpoint: Received decommission executor 
message 1.
   20/05/27 00:06:11.433 dispatcher-CoarseGrainedScheduler INFO 
CoarseGrainedSchedulerBackend$DriverEndpoint: Starting decommissioning executor 
1.
   20/05/27 00:06:11.434 dispatcher-CoarseGrainedScheduler INFO 
CoarseGrainedSchedulerBackend$DriverEndpoint: Finished decommissioning executor 
1.
   20/05/27 00:06:11.434 dispatcher-CoarseGrainedScheduler INFO 
CoarseGrainedSchedulerBackend$DriverEndpoint: Starting decommissioning block 
manager corresponding to executor 1.
   20/05/27 00:06:11.435 dispatcher-CoarseGrainedScheduler INFO 
CoarseGrainedSchedulerBackend$DriverEndpoint: Acknowledged decommissioning 
block manager corresponding to 1.
   20/05/27 00:06:11.922 dispatcher-BlockManagerMaster INFO BlockManagerInfo: 
Added rdd_1_0 in memory on 192.168.1.210:57044 (size: 56.0 B, free: 546.3 MiB)
   20/05/27 00:06:11.927 dispatcher-BlockManagerMaster INFO BlockManagerInfo: 
Added rdd_1_1 in memory on 192.168.1.210:57048 (size: 56.0 B, free: 546.3 MiB)
   20/05/27 00:06:11.933 task-result-getter-2 INFO TaskSetManager: Finished 
task 0.0 in stage 1.0 (TID 10) in 610 ms on 192.168.1.210 (executor 1) (1/10)
   20/05/27 00:06:11.940 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 2.0 in stage 1.0 (TID 12, 192.168.1.210, executor 0, partition 2, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:11.940 task-result-getter-3 INFO TaskSetManager: Finished 
task 1.0 in stage 1.0 (TID 11) in 616 ms on 192.168.1.210 (executor 0) (2/10)
   20/05/27 00:06:12.454 dispatcher-BlockManagerMaster INFO BlockManagerInfo: 
Added rdd_1_2 in memory on 192.168.1.210:57048 (size: 56.0 B, free: 546.3 MiB)
   20/05/27 00:06:12.459 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 3.0 in stage 1.0 (TID 13, 192.168.1.210, executor 0, partition 3, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:12.460 task-result-getter-0 INFO TaskSetManager: Finished 
task 2.0 in stage 1.0 (TID 12) in 521 ms on 192.168.1.210 (executor 0) (3/10)
   20/05/27 00:06:12.971 dispatcher-BlockManagerMaster INFO BlockManagerInfo: 
Added rdd_1_3 in memory on 192.168.1.210:57048 (size: 56.0 B, free: 546.3 MiB)
   20/05/27 00:06:12.977 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 4.0 in stage 1.0 (TID 14, 192.168.1.210, executor 0, partition 4, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:12.977 task-result-getter-1 INFO TaskSetManager: Finished 
task 3.0 in stage 1.0 (TID 13) in 518 ms on 192.168.1.210 (executor 0) (4/10)
   20/05/27 00:06:13.489 dispatcher-BlockManagerMaster INFO BlockManagerInfo: 
Added rdd_1_4 in memory on 192.168.1.210:57048 (size: 56.0 B, free: 546.3 MiB)
   20/05/27 00:06:13.494 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 5.0 in stage 1.0 (TID 15, 192.168.1.210, executor 0, partition 5, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:13.495 task-result-getter-2 INFO TaskSetManager: Finished 
task 4.0 in stage 1.0 (TID 14) in 518 ms on 192.168.1.210 (executor 0) (5/10)
   20/05/27 00:06:14.007 dispatcher-BlockManagerMaster INFO BlockManagerInfo: 
Added rdd_1_5 in memory on 192.168.1.210:57048 (size: 56.0 B, free: 546.3 MiB)
   20/05/27 00:06:14.012 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 6.0 in stage 1.0 (TID 16, 192.168.1.210, executor 0, partition 6, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:14.012 task-result-getter-3 INFO TaskSetManager: Finished 
task 5.0 in stage 1.0 (TID 15) in 519 ms on 192.168.1.210 (executor 0) (6/10)
   20/05/27 00:06:14.523 dispatcher-BlockManagerMaster INFO BlockManagerInfo: 
Added rdd_1_6 in memory on 192.168.1.210:57048 (size: 56.0 B, free: 546.3 MiB)
   20/05/27 00:06:14.527 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 7.0 in stage 1.0 (TID 17, 192.168.1.210, executor 0, partition 7, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:14.528 task-result-getter-0 INFO TaskSetManager: Finished 
task 6.0 in stage 1.0 (TID 16) in 517 ms on 192.168.1.210 (executor 0) (7/10)
   20/05/27 00:06:15.040 dispatcher-BlockManagerMaster INFO BlockManagerInfo: 
Added rdd_1_7 in memory on 192.168.1.210:57048 (size: 56.0 B, free: 546.3 MiB)
   20/05/27 00:06:15.044 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 8.0 in stage 1.0 (TID 18, 192.168.1.210, executor 0, partition 8, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:15.045 task-result-getter-1 INFO TaskSetManager: Finished 
task 7.0 in stage 1.0 (TID 17) in 518 ms on 192.168.1.210 (executor 0) (8/10)
   20/05/27 00:06:15.559 dispatcher-BlockManagerMaster INFO BlockManagerInfo: 
Added rdd_1_8 in memory on 192.168.1.210:57048 (size: 56.0 B, free: 546.3 MiB)
   20/05/27 00:06:15.564 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 9.0 in stage 1.0 (TID 19, 192.168.1.210, executor 0, partition 9, 
PROCESS_LOCAL, 7454 bytes) taskResourceAssignments Map()
   20/05/27 00:06:15.564 task-result-getter-2 INFO TaskSetManager: Finished 
task 8.0 in stage 1.0 (TID 18) in 520 ms on 192.168.1.210 (executor 0) (9/10)
   20/05/27 00:06:16.079 dispatcher-BlockManagerMaster INFO BlockManagerInfo: 
Added rdd_1_9 in memory on 192.168.1.210:57048 (size: 56.0 B, free: 546.3 MiB)
   20/05/27 00:06:16.084 task-result-getter-3 INFO TaskSetManager: Finished 
task 9.0 in stage 1.0 (TID 19) in 520 ms on 192.168.1.210 (executor 0) (10/10)
   20/05/27 00:06:16.084 task-result-getter-3 INFO TaskSchedulerImpl: Removed 
TaskSet 1.0, whose tasks have all completed, from pool 
   20/05/27 00:06:16.085 dag-scheduler-event-loop INFO DAGScheduler: 
ResultStage 1 (countAsync at BlockManagerDecommissionSuite.scala:92) finished 
in 4.771 s
   20/05/27 00:06:16.085 dag-scheduler-event-loop INFO DAGScheduler: Job 1 is 
finished. Cancelling potential speculative or zombie tasks for this job
   20/05/27 00:06:16.085 dag-scheduler-event-loop INFO TaskSchedulerImpl: 
Killing all running tasks in stage 1: Stage finished
   20/05/27 00:06:17.105 ScalaTest-main-running-BlockManagerDecommissionSuite 
INFO SparkContext: Starting job: count at 
BlockManagerDecommissionSuite.scala:134
   20/05/27 00:06:17.106 dag-scheduler-event-loop INFO DAGScheduler: Got job 2 
(count at BlockManagerDecommissionSuite.scala:134) with 10 output partitions
   20/05/27 00:06:17.106 dag-scheduler-event-loop INFO DAGScheduler: Final 
stage: ResultStage 2 (count at BlockManagerDecommissionSuite.scala:134)
   20/05/27 00:06:17.106 dag-scheduler-event-loop INFO DAGScheduler: Parents of 
final stage: List()
   20/05/27 00:06:17.107 dag-scheduler-event-loop INFO DAGScheduler: Missing 
parents: List()
   20/05/27 00:06:17.107 dag-scheduler-event-loop INFO DAGScheduler: Submitting 
ResultStage 2 (MapPartitionsRDD[1] at mapPartitions at 
BlockManagerDecommissionSuite.scala:63), which has no missing parents
   20/05/27 00:06:17.110 dag-scheduler-event-loop INFO MemoryStore: Block 
broadcast_2 stored as values in memory (estimated size 4.2 KiB, free 2.1 GiB)
   20/05/27 00:06:17.111 dag-scheduler-event-loop INFO MemoryStore: Block 
broadcast_2_piece0 stored as bytes in memory (estimated size 2.4 KiB, free 2.1 
GiB)
   20/05/27 00:06:17.112 dispatcher-BlockManagerMaster INFO BlockManagerInfo: 
Added broadcast_2_piece0 in memory on apiros-mbp16.lan:57037 (size: 2.4 KiB, 
free: 2.1 GiB)
   20/05/27 00:06:17.112 dag-scheduler-event-loop INFO SparkContext: Created 
broadcast 2 from broadcast at DAGScheduler.scala:1293
   20/05/27 00:06:17.113 dag-scheduler-event-loop INFO DAGScheduler: Submitting 
10 missing tasks from ResultStage 2 (MapPartitionsRDD[1] at mapPartitions at 
BlockManagerDecommissionSuite.scala:63) (first 15 tasks are for partitions 
Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
   20/05/27 00:06:17.113 dag-scheduler-event-loop INFO TaskSchedulerImpl: 
Adding task set 2.0 with 10 tasks resource profile 0
   20/05/27 00:06:17.117 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 1.0 in stage 2.0 (TID 20, 192.168.1.210, executor 0, partition 1, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:17.129 dispatcher-BlockManagerMaster INFO BlockManagerInfo: 
Added broadcast_2_piece0 in memory on 192.168.1.210:57048 (size: 2.4 KiB, free: 
546.3 MiB)
   20/05/27 00:06:17.135 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 2.0 in stage 2.0 (TID 21, 192.168.1.210, executor 0, partition 2, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:17.136 task-result-getter-0 INFO TaskSetManager: Finished 
task 1.0 in stage 2.0 (TID 20) in 20 ms on 192.168.1.210 (executor 0) (1/10)
   20/05/27 00:06:17.141 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 3.0 in stage 2.0 (TID 22, 192.168.1.210, executor 0, partition 3, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:17.142 task-result-getter-1 INFO TaskSetManager: Finished 
task 2.0 in stage 2.0 (TID 21) in 7 ms on 192.168.1.210 (executor 0) (2/10)
   20/05/27 00:06:17.147 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 4.0 in stage 2.0 (TID 23, 192.168.1.210, executor 0, partition 4, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:17.147 task-result-getter-2 INFO TaskSetManager: Finished 
task 3.0 in stage 2.0 (TID 22) in 6 ms on 192.168.1.210 (executor 0) (3/10)
   20/05/27 00:06:17.153 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 5.0 in stage 2.0 (TID 24, 192.168.1.210, executor 0, partition 5, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:17.153 task-result-getter-3 INFO TaskSetManager: Finished 
task 4.0 in stage 2.0 (TID 23) in 6 ms on 192.168.1.210 (executor 0) (4/10)
   20/05/27 00:06:17.159 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 6.0 in stage 2.0 (TID 25, 192.168.1.210, executor 0, partition 6, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:17.159 task-result-getter-0 INFO TaskSetManager: Finished 
task 5.0 in stage 2.0 (TID 24) in 6 ms on 192.168.1.210 (executor 0) (5/10)
   20/05/27 00:06:17.166 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 7.0 in stage 2.0 (TID 26, 192.168.1.210, executor 0, partition 7, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:17.166 task-result-getter-1 INFO TaskSetManager: Finished 
task 6.0 in stage 2.0 (TID 25) in 8 ms on 192.168.1.210 (executor 0) (6/10)
   20/05/27 00:06:17.171 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 8.0 in stage 2.0 (TID 27, 192.168.1.210, executor 0, partition 8, 
PROCESS_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:17.171 task-result-getter-2 INFO TaskSetManager: Finished 
task 7.0 in stage 2.0 (TID 26) in 6 ms on 192.168.1.210 (executor 0) (7/10)
   20/05/27 00:06:17.177 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 9.0 in stage 2.0 (TID 28, 192.168.1.210, executor 0, partition 9, 
PROCESS_LOCAL, 7454 bytes) taskResourceAssignments Map()
   20/05/27 00:06:17.177 task-result-getter-3 INFO TaskSetManager: Finished 
task 8.0 in stage 2.0 (TID 27) in 6 ms on 192.168.1.210 (executor 0) (8/10)
   20/05/27 00:06:17.183 task-result-getter-0 INFO TaskSetManager: Finished 
task 9.0 in stage 2.0 (TID 28) in 7 ms on 192.168.1.210 (executor 0) (9/10)
   20/05/27 00:06:20.661 dispatcher-CoarseGrainedScheduler INFO TaskSetManager: 
Starting task 0.0 in stage 2.0 (TID 29, 192.168.1.210, executor 0, partition 0, 
NODE_LOCAL, 7397 bytes) taskResourceAssignments Map()
   20/05/27 00:06:20.679 task-result-getter-1 INFO TaskSetManager: Finished 
task 0.0 in stage 2.0 (TID 29) in 18 ms on 192.168.1.210 (executor 0) (10/10)
   20/05/27 00:06:20.679 task-result-getter-1 INFO TaskSchedulerImpl: Removed 
TaskSet 2.0, whose tasks have all completed, from pool 
   20/05/27 00:06:20.680 dag-scheduler-event-loop INFO DAGScheduler: 
ResultStage 2 (count at BlockManagerDecommissionSuite.scala:134) finished in 
3.572 s
   20/05/27 00:06:20.680 dag-scheduler-event-loop INFO DAGScheduler: Job 2 is 
finished. Cancelling potential speculative or zombie tasks for this job
   20/05/27 00:06:20.680 dag-scheduler-event-loop INFO TaskSchedulerImpl: 
Killing all running tasks in stage 2: Stage finished
   20/05/27 00:06:20.680 ScalaTest-main-running-BlockManagerDecommissionSuite 
INFO DAGScheduler: Job 2 finished: count at 
BlockManagerDecommissionSuite.scala:134, took 3.575365 s
   20/05/27 00:06:20.699 ScalaTest-main-running-BlockManagerDecommissionSuite 
INFO BlockManagerDecommissionSuite: 
   
   ===== FINISHED o.a.s.storage.BlockManagerDecommissionSuite: 'verify that an 
already running task which is going to cache data succeeds on a decommissioned 
executor' =====
   ```
   
   As I cannot see  a "Starting block manager decommissioning process..."  
somewhere after the "Starting decommissioning block manager corresponding to 
executor 1" line I suspect it must be the asynchronous execution.
   Here 
https://github.com/apache/spark/blob/a240f9857dcf53631b269422b4652a05304691be/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L320
 we return with a `Future`  which might not be completed, see:
   
https://github.com/apache/spark/blob/a240f9857dcf53631b269422b4652a05304691be/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L160-L163
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to