[ 
https://issues.apache.org/jira/browse/SPARK-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14239389#comment-14239389
 ] 

Sean Owen commented on SPARK-4796:
----------------------------------

Do you mean "job" in the sense this is used in Spark? 
http://spark.apache.org/docs/latest/cluster-overview.html
A job is used to compute partitions of an RDD. If the partitions are lost or 
are needed again, the job will run again. This is just how Spark works. I 
assume you mean executors are still running. I don't know if it's documented 
that the shuffle files are retained, but pretty sure that's correct. It's not 
clear what the temp files are or whether they're in use or not. Obviously if 
the executor needs them they can't go away. Or maybe it is in fact something 
that something can clean up earlier. If you're running with 10GB disk and 
easily overrunning it, I think temp files may not be the ultimate issue here.

> Spark does not remove temp files
> --------------------------------
>
>                 Key: SPARK-4796
>                 URL: https://issues.apache.org/jira/browse/SPARK-4796
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 1.1.0
>         Environment: I'm runnin spark on mesos and mesos slaves are docker 
> containers. Spark 1.1.0, elasticsearch spark 2.1.0-Beta3, mesos 0.20.0, 
> docker 1.2.0.
>            Reporter: Ian Babrou
>
> I started a job that cannot fill into memory and got "no space left on 
> device". That was fair, because docker containers only have 10gb of disk 
> space and some is taken by OS already.
> But then I found out when job failed it didn't release any disk space and 
> left container without any free disk space.
> Then I decided to check if spark removes temp files in any case, because many 
> mesos slaves had /tmp/spark-local-*. Apparently some garbage stays after 
> spark task is finished. I attached with strace to running job:
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/12/temp_8a73fcc2-4baa-499a-8add-0161f918de8a")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/31/temp_47efd04b-d427-4139-8f48-3d5d421e9be4")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/15/temp_619a46dc-40de-43f1-a844-4db146a607c6")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/05/temp_d97d90a7-8bc1-4742-ba9b-41d74ea73c36"
>  <unfinished ...>
> [pid 30212] <... unlink resumed> )      = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/36/temp_a2deb806-714a-457a-90c8-5d9f3247a5d7")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/04/temp_afd558f1-2fd0-48d7-bc65-07b5f4455b22")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/32/temp_a7add910-8dc3-482c-baf5-09d5a187c62a"
>  <unfinished ...>
> [pid 30212] <... unlink resumed> )      = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/21/temp_485612f0-527f-47b0-bb8b-6016f3b9ec19")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/12/temp_bb2b4e06-a9dd-408e-8395-f6c5f4e2d52f")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/1e/temp_825293c6-9d3b-4451-9cb8-91e2abe5a19d"
>  <unfinished ...>
> [pid 30212] <... unlink resumed> )      = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/15/temp_43fbb94c-9163-4aa7-ab83-e7693b9f21fc")
>  = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/3d/temp_37f3629c-1b09-4907-b599-61b7df94b898"
>  <unfinished ...>
> [pid 30212] <... unlink resumed> )      = 0
> [pid 30212] 
> unlink("/tmp/spark-local-20141209091330-48b5/35/temp_d18f49f6-1fb1-4c01-a694-0ee0a72294c0")
>  = 0
> And after job is finished, some files are still there:
> /tmp/spark-local-20141209091330-48b5/
> /tmp/spark-local-20141209091330-48b5/11
> /tmp/spark-local-20141209091330-48b5/11/shuffle_0_1_4
> /tmp/spark-local-20141209091330-48b5/32
> /tmp/spark-local-20141209091330-48b5/04
> /tmp/spark-local-20141209091330-48b5/05
> /tmp/spark-local-20141209091330-48b5/0f
> /tmp/spark-local-20141209091330-48b5/0f/shuffle_0_1_2
> /tmp/spark-local-20141209091330-48b5/3d
> /tmp/spark-local-20141209091330-48b5/0e
> /tmp/spark-local-20141209091330-48b5/0e/shuffle_0_1_1
> /tmp/spark-local-20141209091330-48b5/15
> /tmp/spark-local-20141209091330-48b5/0d
> /tmp/spark-local-20141209091330-48b5/0d/shuffle_0_1_0
> /tmp/spark-local-20141209091330-48b5/36
> /tmp/spark-local-20141209091330-48b5/31
> /tmp/spark-local-20141209091330-48b5/12
> /tmp/spark-local-20141209091330-48b5/21
> /tmp/spark-local-20141209091330-48b5/10
> /tmp/spark-local-20141209091330-48b5/10/shuffle_0_1_3
> /tmp/spark-local-20141209091330-48b5/1e
> /tmp/spark-local-20141209091330-48b5/35
> If I look into my mesos slaves, there are mostly "shuffle" files, overall 
> picture for single node:
> root@web338:~# find /tmp/spark-local-20141* -type f | fgrep shuffle | wc -l
> 781
> root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle | wc -l
> 10
> root@web338:~# find /tmp/spark-local-20141* -type f | fgrep -v shuffle
> /tmp/spark-local-20141119144512-67c4/2d/temp_9056f380-3edb-48d6-a7df-d4896f1e1cc3
> /tmp/spark-local-20141119144512-67c4/3d/temp_e005659b-eddf-4a34-947f-4f63fcddf111
> /tmp/spark-local-20141119144512-67c4/16/temp_71eba702-36b4-4e1a-aebc-20d2080f1705
> /tmp/spark-local-20141119144512-67c4/0d/temp_8037b9db-2d8a-4786-a554-a8cad922bf5e
> /tmp/spark-local-20141119144512-67c4/24/temp_f0e4cc43-6cc9-42a7-882d-f8a031fa4dc3
> /tmp/spark-local-20141119144512-67c4/29/temp_a8bbe2cb-f590-4b71-8ef8-9c0324beddc7
> /tmp/spark-local-20141119144512-67c4/3a/temp_9fc08519-f23a-40ac-a3fd-e58df6871460
> /tmp/spark-local-20141119144512-67c4/1e/temp_d66668ab-2999-48af-a136-84cfd6f5f6cb
> /tmp/spark-local-20141205110922-f78e/0a/temp_7409add5-e6ff-46e5-ae3f-6a4c7b2ddf8f
> /tmp/spark-local-20141205111026-0b53/01/temp_72024c94-7512-4692-8bd1-ef2417143d8c
> Conclusions:
> 1. Shuffle files should be removed, but they stay. 
> 2. Temp files should always be removed, but they stay.
> Maybe we should unlink temp and shuffle files immediately after creation to 
> remove them even if spark fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to