Github user bbossy commented on a diff in the pull request:
https://github.com/apache/spark/pull/11272#discussion_r53795749
--- Diff:
network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/BlockTransferMessage.java
---
@@ -40,7 +41,8 @@
/** Preceding every serialized message is its type, which allows us to
deserialize it. */
public static enum Type {
- OPEN_BLOCKS(0), UPLOAD_BLOCK(1), REGISTER_EXECUTOR(2),
STREAM_HANDLE(3), REGISTER_DRIVER(4);
+ OPEN_BLOCKS(0), UPLOAD_BLOCK(1), REGISTER_EXECUTOR(2),
STREAM_HANDLE(3), REGISTER_DRIVER(4),
+ HEARTBEAT(5);
--- End diff --
I haven't done extensive testing, but I started a standalone master and
some slaves on a test cluster. Started a spark-shell, parallelized some Ints,
cached them using ```persist(StorageLevel.DISK_ONLY)``` and did a ```sum``` on
that. Then I let it idle for a bit, and later performed another transformation
on the cached rdd. That worked fine.
I had this in the log of one of the workers:
```
16/02/23 13:21:58 INFO deploy.ExternalShuffleService: Starting shuffle
service on port 7338 with useSasl = false
...
16/02/23 13:21:59 INFO shuffle.ExternalShuffleBlockResolver: Registered
executor AppExecId{appId=app-20160223131906-0002, execId=0} with
ExecutorShuffleInfo{localDirs=[/tmp/spark-117baa2b-aa4f-4c81-9724-f41d96182357/executor-0684a8d3-0996-4d96-8031-b41abd958df4/blockmgr-aa778ad3-43e7-494e-b527-3b1eaf41fdb5],
subDirsPerLocalDir=64, shuffleManager=sort}
16/02/23 13:37:40 INFO worker.Worker: Asked to kill executor
app-20160223131906-0002/0
16/02/23 13:37:40 INFO worker.ExecutorRunner: Runner thread for executor
app-20160223131906-0002/0 interrupted
16/02/23 13:37:40 INFO worker.ExecutorRunner: Killing process!
16/02/23 13:37:40 INFO worker.Worker: Executor app-20160223131906-0002/0
finished with state KILLED exitStatus 137
16/02/23 13:37:40 INFO worker.Worker: Cleaning up local directories for
application app-20160223131906-0002
16/02/23 13:37:40 INFO shuffle.ExternalShuffleBlockResolver: Application
app-20160223131906-0002 removed, cleanupLocalDirs = true
16/02/23 13:37:40 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up
executor AppExecId{appId=app-20160223131906-0002, execId=0}'s 1 local dirs
```
Aside: Is ```sbin/start-shuffle-service.sh``` still used somewhere? It
seems that in standalone the shuffle service gets started by the worker.
What do you have in mind for a serious test?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]