Github user bbossy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11272#discussion_r53795749
  
    --- Diff: 
network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/BlockTransferMessage.java
 ---
    @@ -40,7 +41,8 @@
     
       /** Preceding every serialized message is its type, which allows us to 
deserialize it. */
       public static enum Type {
    -    OPEN_BLOCKS(0), UPLOAD_BLOCK(1), REGISTER_EXECUTOR(2), 
STREAM_HANDLE(3), REGISTER_DRIVER(4);
    +    OPEN_BLOCKS(0), UPLOAD_BLOCK(1), REGISTER_EXECUTOR(2), 
STREAM_HANDLE(3), REGISTER_DRIVER(4),
    +    HEARTBEAT(5);
    --- End diff --
    
    I haven't done extensive testing, but I started a standalone master and 
some slaves on a test cluster. Started a spark-shell, parallelized some Ints, 
cached them using ```persist(StorageLevel.DISK_ONLY)``` and did a ```sum``` on 
that. Then I let it idle for a bit, and later performed another transformation 
on the cached rdd. That worked fine.
    
    I had this in the log of one of the workers:
    ```
    16/02/23 13:21:58 INFO deploy.ExternalShuffleService: Starting shuffle 
service on port 7338 with useSasl = false
    ...
    16/02/23 13:21:59 INFO shuffle.ExternalShuffleBlockResolver: Registered 
executor AppExecId{appId=app-20160223131906-0002, execId=0} with 
ExecutorShuffleInfo{localDirs=[/tmp/spark-117baa2b-aa4f-4c81-9724-f41d96182357/executor-0684a8d3-0996-4d96-8031-b41abd958df4/blockmgr-aa778ad3-43e7-494e-b527-3b1eaf41fdb5],
 subDirsPerLocalDir=64, shuffleManager=sort}
    16/02/23 13:37:40 INFO worker.Worker: Asked to kill executor 
app-20160223131906-0002/0
    16/02/23 13:37:40 INFO worker.ExecutorRunner: Runner thread for executor 
app-20160223131906-0002/0 interrupted
    16/02/23 13:37:40 INFO worker.ExecutorRunner: Killing process!
    16/02/23 13:37:40 INFO worker.Worker: Executor app-20160223131906-0002/0 
finished with state KILLED exitStatus 137
    16/02/23 13:37:40 INFO worker.Worker: Cleaning up local directories for 
application app-20160223131906-0002
    16/02/23 13:37:40 INFO shuffle.ExternalShuffleBlockResolver: Application 
app-20160223131906-0002 removed, cleanupLocalDirs = true
    16/02/23 13:37:40 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up 
executor AppExecId{appId=app-20160223131906-0002, execId=0}'s 1 local dirs
    ```
    
    Aside: Is ```sbin/start-shuffle-service.sh``` still used somewhere? It 
seems that in standalone the shuffle service gets started by the worker.
    
    What do you have in mind for a serious test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to