Charles Allen created SPARK-12330:
-------------------------------------

             Summary: Mesos coarse executor does not cleanup blockmgr properly 
on termination if data is stored on disk
                 Key: SPARK-12330
                 URL: https://issues.apache.org/jira/browse/SPARK-12330
             Project: Spark
          Issue Type: Bug
          Components: Block Manager, Mesos
    Affects Versions: 1.5.1
            Reporter: Charles Allen


A simple line count example can be launched as similar to 

{code}
SPARK_HOME=/mnt/tmp/spark 
MASTER=mesos://zk://zk.metamx-prod.com:2181/mesos-druid/metrics 
./bin/spark-shell --conf spark.mesos.coarse=true --conf spark.cores.max=7 
--conf spark.mesos.executor.memoryOverhead=2048 --conf 
spark.mesos.executor.home=/mnt/tmp/spark --conf 
spark.executor.extraJavaOptions='-Duser.timezone=UTC -Dfile.encoding=UTF-8 
-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution 
-XX:+PrintFlagsFinal -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=1024m 
-verbose:gc -XX:+PrintFlagsFinal -Djava.io.tmpdir=/mnt/tmp/scratch' --conf 
spark.hadoop.fs.s3n.awsAccessKeyId='REDACTED' --conf 
spark.hadoop.fs.s3n.awsSecretAccessKey='REDACTED' --conf 
spark.executor.memory=7g --conf spark.executorEnv.GLOG_v=9 --conf 
spark.storage.memoryFraction=0.0 --conf spark.shuffle.memoryFraction=0.0
{code}

In the shell the following lines can be executed:

{code}
val text_file = 
sc.textFile("s3n://REDACTED/charlesallen/tpch/lineitem.tbl").persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
{code}
{code}
text_file.map(l => 1).sum
{code}
which will result in
{code}
res0: Double = 6001215.0
{code}
for the TPCH 1GB dataset

Unfortunately the blockmgr directory remains on the executor node after 
termination of the spark context.

The log on the executor looks like this near the termination:

{code}
I1215 02:12:31.190878 130732 process.cpp:566] Parsed message name 
'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.30:58604 
from slave(1)@172.19.67.30:5051
I1215 02:12:31.190928 130732 process.cpp:2382] Spawned process 
__http__(4)@172.19.67.30:58604
I1215 02:12:31.190932 130721 process.cpp:2392] Resuming 
executor(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.190924800+00:00
I1215 02:12:31.190958 130702 process.cpp:2392] Resuming 
__http__(4)@172.19.67.30:58604 at 2015-12-15 02:12:31.190951936+00:00
I1215 02:12:31.190976 130721 exec.cpp:381] Executor asked to shutdown
I1215 02:12:31.190943 130727 process.cpp:2392] Resuming 
__gc__@172.19.67.30:58604 at 2015-12-15 02:12:31.190937088+00:00
I1215 02:12:31.190991 130702 process.cpp:2497] Cleaning up 
__http__(4)@172.19.67.30:58604
I1215 02:12:31.191032 130721 process.cpp:2382] Spawned process 
(2)@172.19.67.30:58604
I1215 02:12:31.191040 130702 process.cpp:2392] Resuming (2)@172.19.67.30:58604 
at 2015-12-15 02:12:31.191037952+00:00
I1215 02:12:31.191054 130702 exec.cpp:80] Scheduling shutdown of the executor
I1215 02:12:31.191069 130721 exec.cpp:396] Executor::shutdown took 21572ns
I1215 02:12:31.191073 130702 clock.cpp:260] Created a timer for 
(2)@172.19.67.30:58604 in 5secs in the future (2015-12-15 
02:12:36.191062016+00:00)
I1215 02:12:31.191066 130720 process.cpp:2392] Resuming (1)@172.19.67.30:58604 
at 2015-12-15 02:12:31.191059200+00:00
15/12/15 02:12:31 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
I1215 02:12:31.240103 130732 clock.cpp:151] Handling timers up to 2015-12-15 
02:12:31.240091136+00:00
I1215 02:12:31.240123 130732 clock.cpp:158] Have timeout(s) at 2015-12-15 
02:12:31.240036096+00:00
I1215 02:12:31.240183 130730 process.cpp:2392] Resuming 
reaper(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.240178176+00:00
I1215 02:12:31.240226 130730 clock.cpp:260] Created a timer for 
reaper(1)@172.19.67.30:58604 in 100ms in the future (2015-12-15 
02:12:31.340212992+00:00)
I1215 02:12:31.247019 130720 clock.cpp:260] Created a timer for 
(1)@172.19.67.30:58604 in 3secs in the future (2015-12-15 
02:12:34.247005952+00:00)
15/12/15 02:12:31 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
SIGTERM
15/12/15 02:12:31 INFO ShutdownHookManager: Shutdown hook called

....no more java logs
{code}

If the shuffle fraction is NOT set to 0.0, and the data is allowed to stay in 
memory, then the following log can be seen at termination instead:
{code}
I1215 01:19:16.247705 120052 process.cpp:566] Parsed message name 
'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.24:60016 
from slave(1)@172.19.67.24:5051
I1215 01:19:16.247745 120052 process.cpp:2382] Spawned process 
__http__(4)@172.19.67.24:60016
I1215 01:19:16.247747 120034 process.cpp:2392] Resuming 
executor(1)@172.19.67.24:60016 at 2015-12-15 01:19:16.247741952+00:00
I1215 01:19:16.247758 120030 process.cpp:2392] Resuming 
__gc__@172.19.67.24:60016 at 2015-12-15 01:19:16.247755008+00:00
I1215 01:19:16.247772 120034 exec.cpp:381] Executor asked to shutdown
I1215 01:19:16.247772 120038 process.cpp:2392] Resuming 
__http__(4)@172.19.67.24:60016 at 2015-12-15 01:19:16.247767808+00:00
I1215 01:19:16.247791 120038 process.cpp:2497] Cleaning up 
__http__(4)@172.19.67.24:60016
I1215 01:19:16.247814 120034 process.cpp:2382] Spawned process 
(2)@172.19.67.24:60016
I1215 01:19:16.247823 120038 process.cpp:2392] Resuming (2)@172.19.67.24:60016 
at 2015-12-15 01:19:16.247822080+00:00
I1215 01:19:16.247836 120038 exec.cpp:80] Scheduling shutdown of the executor
I1215 01:19:16.247820 120039 process.cpp:2392] Resuming 
__gc__@172.19.67.24:60016 at 2015-12-15 01:19:16.247816960+00:00
I1215 01:19:16.247843 120034 exec.cpp:396] Executor::shutdown took 17355ns
I1215 01:19:16.247849 120038 clock.cpp:260] Created a timer for 
(2)@172.19.67.24:60016 in 5secs in the future (2015-12-15 
01:19:21.247843072+00:00)
I1215 01:19:16.247855 120049 process.cpp:2392] Resuming (1)@172.19.67.24:60016 
at 2015-12-15 01:19:16.247849984+00:00
15/12/15 01:19:16 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
15/12/15 01:19:16 INFO MemoryStore: MemoryStore cleared
15/12/15 01:19:16 INFO BlockManager: BlockManager stopped
15/12/15 01:19:16 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
15/12/15 01:19:16 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
15/12/15 01:19:16 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
15/12/15 01:19:16 INFO ShutdownHookManager: Shutdown hook called

.... no more java logs
{code}

Notice in the former case with dirty directories left over, there is a SIGTERM 
intercepted by the executor, and there are no log messages indicating disk 
cleaning occurred. In the later case, the shutdown hooks for memory and block 
managers are seen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to