attilapiros commented on a change in pull request #24499: [SPARK-25888][Core]
Serve local disk persisted blocks by the external service after releasing
executor by dynamic allocation
URL: https://github.com/apache/spark/pull/24499#discussion_r280070633
##########
File path:
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
##########
@@ -179,6 +179,18 @@ public ManagedBuffer getBlockData(
return getSortBasedShuffleBlockData(executor, shuffleId, mapId, reduceId);
}
+ public ManagedBuffer getBlockData(
Review comment:
I have tried to reproduce the problem and meanwhile I think you are
absolutely right and as non-shuffle files are delete recompute must be
triggered. But in my test somehow the blocks are still found.
Log parts (with 7 partitions as I have 8 cores in this machine). Here are
the cleanups in the worker log:
```
19/05/01 15:06:11 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle
files in executor AppExecId{appId=app-20190501150515-0001, execId=5}'s 1 local
dirs
19/05/01 15:06:11 INFO Worker: Executor app-20190501150515-0001/4 finished
with state KILLED exitStatus 143
19/05/01 15:06:11 INFO ExternalShuffleBlockResolver: Clean up non-shuffle
files associated with the finished executor 4
19/05/01 15:06:11 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle
files in executor AppExecId{appId=app-20190501150515-0001, execId=4}'s 1 local
dirs
19/05/01 15:06:11 INFO Worker: Executor app-20190501150515-0001/6 finished
with state KILLED exitStatus 143
19/05/01 15:06:11 INFO ExternalShuffleBlockResolver: Clean up non-shuffle
files associated with the finished executor 6
19/05/01 15:06:11 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle
files in executor AppExecId{appId=app-20190501150515-0001, execId=6}'s 1 local
dirs
19/05/01 15:06:11 INFO Worker: Executor app-20190501150515-0001/3 finished
with state KILLED exitStatus 143
19/05/01 15:06:11 INFO ExternalShuffleBlockResolver: Clean up non-shuffle
files associated with the finished executor 3
19/05/01 15:06:11 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle
files in executor AppExecId{appId=app-20190501150515-0001, execId=3}'s 1 local
dirs
19/05/01 15:06:11 INFO Worker: Executor app-20190501150515-0001/7 finished
with state KILLED exitStatus 143
19/05/01 15:06:11 INFO ExternalShuffleBlockResolver: Clean up non-shuffle
files associated with the finished executor 7
19/05/01 15:06:11 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle
files in executor AppExecId{appId=app-20190501150515-0001, execId=7}'s 1 local
dirs
19/05/01 15:06:14 INFO Worker: Asked to kill executor
app-20190501150515-0001/0
19/05/01 15:06:14 INFO ExecutorRunner: Runner thread for executor
app-20190501150515-0001/0 interrupted
19/05/01 15:06:14 INFO ExecutorRunner: Killing process!
19/05/01 15:06:14 INFO Worker: Executor app-20190501150515-0001/0 finished
with state KILLED exitStatus 143
19/05/01 15:06:14 INFO ExternalShuffleBlockResolver: Clean up non-shuffle
files associated with the finished executor 0
19/05/01 15:06:14 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle
files in executor AppExecId{appId=app-20190501150515-0001, execId=0}'s 1 local
dirs
19/05/01 15:06:18 INFO Worker: Asked to kill executor
app-20190501150515-0001/2
19/05/01 15:06:18 INFO ExecutorRunner: Runner thread for executor
app-20190501150515-0001/2 interrupted
19/05/01 15:06:18 INFO ExecutorRunner: Killing process!
19/05/01 15:06:18 INFO Worker: Executor app-20190501150515-0001/2 finished
with state KILLED exitStatus 143
19/05/01 15:06:18 INFO ExternalShuffleBlockResolver: Clean up non-shuffle
files associated with the finished executor 2
19/05/01 15:06:18 INFO ExternalShuffleBlockResolver: Cleaning up non-shuffle
files in executor AppExecId{appId=app-20190501150515-0001, execId=2}'s 1 local
dirs
```
And this part is from the end of my test where persisted RDD referenced
again. To execute these tasks a new executor is started. This is from the
stderr of that executor:
```
9/05/01 15:06:52 INFO BlockManager: Found block rdd_1_0 remotely
19/05/01 15:06:52 INFO Executor: Finished task 0.0 in stage 2.0 (TID 11).
697 bytes result sent to driver
19/05/01 15:06:52 INFO CoarseGrainedExecutorBackend: Got assigned task 12
19/05/01 15:06:52 INFO Executor: Running task 1.0 in stage 2.0 (TID 12)
19/05/01 15:06:52 INFO BlockManager: Found block rdd_1_1 remotely
19/05/01 15:06:52 INFO Executor: Finished task 1.0 in stage 2.0 (TID 12).
697 bytes result sent to driver
19/05/01 15:06:52 INFO CoarseGrainedExecutorBackend: Got assigned task 13
19/05/01 15:06:52 INFO Executor: Running task 2.0 in stage 2.0 (TID 13)
19/05/01 15:06:52 INFO BlockManager: Found block rdd_1_2 remotely
19/05/01 15:06:52 INFO Executor: Finished task 2.0 in stage 2.0 (TID 13).
697 bytes result sent to driver
19/05/01 15:06:52 INFO CoarseGrainedExecutorBackend: Got assigned task 14
19/05/01 15:06:52 INFO Executor: Running task 3.0 in stage 2.0 (TID 14)
19/05/01 15:06:52 INFO BlockManager: Found block rdd_1_3 remotely
19/05/01 15:06:52 INFO Executor: Finished task 3.0 in stage 2.0 (TID 14).
697 bytes result sent to driver
19/05/01 15:06:52 INFO CoarseGrainedExecutorBackend: Got assigned task 15
19/05/01 15:06:52 INFO Executor: Running task 4.0 in stage 2.0 (TID 15)
19/05/01 15:06:52 INFO BlockManager: Found block rdd_1_4 remotely
19/05/01 15:06:52 INFO Executor: Finished task 4.0 in stage 2.0 (TID 15).
697 bytes result sent to driver
19/05/01 15:06:52 INFO CoarseGrainedExecutorBackend: Got assigned task 16
19/05/01 15:06:52 INFO Executor: Running task 5.0 in stage 2.0 (TID 16)
19/05/01 15:06:52 INFO BlockManager: Found block rdd_1_5 remotely
19/05/01 15:06:52 INFO Executor: Finished task 5.0 in stage 2.0 (TID 16).
697 bytes result sent to driver
19/05/01 15:06:52 INFO CoarseGrainedExecutorBackend: Got assigned task 17
19/05/01 15:06:52 INFO Executor: Running task 6.0 in stage 2.0 (TID 17)
19/05/01 15:06:52 INFO BlockManager: Found block rdd_1_6 remotely
19/05/01 15:06:52 INFO Executor: Finished task 6.0 in stage 2.0 (TID 17).
697 bytes result sent to driver
19/05/01 15:06:52 INFO CoarseGrainedExecutorBackend: Got assigned task 18
19/05/01 15:06:52 INFO Executor: Running task 7.0 in stage 2.0 (TID 18)
19/05/01 15:06:52 INFO BlockManager: Found block rdd_1_7 remotely
```
I will come back to this tomorrow. I would like to understand how this works
exactly. Otherwise a possible correction (switching off the cleanup) would
duplicate these blocks.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]