spark 1.6.0 uses /tmp in the following places # spark.local.dir is not set yarn.nodemanager.local-dirs=/data01/yarn/nm,/data02/yarn/nm
1. spark-shell on start 16/03/01 08:33:48 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-ffd3143d-b47f-4844-99fd-2d51c6a05d05 2. spark-shell on start 16/03/01 08:33:50 INFO yarn.Client: Uploading resource file:/tmp/spark-456184c9-d59f-48f4-a9b0-560b7d310655/__spark_conf__6943938018805427428.zip -> hdfs://ip-10-101-124-30:8020/user/hadoop/.sparkStaging/application_1456776184284_0047/__spark_conf__6943938018805427428.zip 3. spark-shell spark-sql (Hive) on start 16/03/01 08:34:06 INFO session.SessionState: Created local directory: /tmp/01705299-a384-4e85-923b-e858017cf351_resources 16/03/01 08:34:06 INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/01705299-a384-4e85-923b-e858017cf351 16/03/01 08:34:06 INFO session.SessionState: Created local directory: /tmp/hadoop/01705299-a384-4e85-923b-e858017cf351 16/03/01 08:34:06 INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/01705299-a384-4e85-923b-e858017cf351/_tmp_space.db 4. Spark Executor container uses hadoop.tmp.dir /data01/tmp/hadoop-${ user.name} for s3 output scala> sc.parallelize(1 to 10).saveAsTextFile("s3n://my_bucket/test/p10_13"); 16/03/01 08:41:13 INFO s3native.NativeS3FileSystem: OutputStream for key 'test/p10_13/part-00000' writing to tempfile '/data01/tmp/hadoop-hadoop/s3/output-7399167152756918334.tmp' -------------------------------------------------- if I set spark.local.dir=/data01/tmp then #1 and #2 uses /data01/tmp instead of /tmp -------------------------------------------------- 1. 16/03/01 08:47:03 INFO storage.DiskBlockManager: Created local directory at /data01/tmp/blockmgr-db88dbd2-0ef4-433a-95ea-b33392bbfb7f 2. 16/03/01 08:47:05 INFO yarn.Client: Uploading resource file:/data01/tmp/spark-aa3e619c-a368-4f95-bd41-8448a78ae456/__spark_conf__368426817234224667.zip -> hdfs://ip-10-101-124-30:8020/user/hadoop/.sparkStaging/application_1456776184284_0050/__spark_conf__368426817234224667.zip 3. spark-sql (hive) still uses /tmp 16/03/01 08:47:20 INFO session.SessionState: Created local directory: /tmp/d315926f-39d7-4dcb-b3fa-60e9976f7197_resources 16/03/01 08:47:20 INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/d315926f-39d7-4dcb-b3fa-60e9976f7197 16/03/01 08:47:20 INFO session.SessionState: Created local directory: /tmp/hadoop/d315926f-39d7-4dcb-b3fa-60e9976f7197 16/03/01 08:47:20 INFO session.SessionState: Created HDFS directory: /tmp/hive/hadoop/d315926f-39d7-4dcb-b3fa-60e9976f7197/_tmp_space.db 4. executor uses hadoop.tmp.dir for s3 output 16/03/01 08:50:01 INFO s3native.NativeS3FileSystem: OutputStream for key 'test/p10_16/_SUCCESS' writing to tempfile '/data01/tmp/hadoop-hadoop/s3/output-2541604454681305094.tmp' 5. /data0X/yarn/nm used for usercache 16/03/01 08:41:12 INFO storage.DiskBlockManager: Created local directory at /data01/yarn/nm/usercache/hadoop/appcache/application_1456776184284_0047/blockmgr-af5 On Mon, Feb 29, 2016 at 3:44 PM, Jeff Zhang <zjf...@gmail.com> wrote: > In yarn mode, spark.local.dir is yarn.nodemanager.local-dirs for shuffle > data and block manager disk data. What do you mean "But output files to > upload to s3 still created in /tmp on slaves" ? You should have control on > where to store your output data if that means your job's output. > > On Tue, Mar 1, 2016 at 3:12 AM, Alexander Pivovarov <apivova...@gmail.com> > wrote: > >> I have Spark on yarn >> >> I defined yarn.nodemanager.local-dirs to be >> /data01/yarn/nm,/data02/yarn/nm >> >> when I look at yarn executor container log I see that blockmanager files >> created in /data01/yarn/nm,/data02/yarn/nm >> >> But output files to upload to s3 still created in /tmp on slaves >> >> I do not want Spark write heavy files to /tmp because /tmp is only 5GB >> >> spark slaves have two big additional disks /disk01 and /disk02 attached >> >> Probably I can set spark.local.dir to be /data01/tmp,/data02/tmp >> >> But spark master also writes some files to spark.local.dir >> But my master box has only one additional disk /data01 >> >> So, what should I use for spark.local.dir the >> spark.local.dir=/data01/tmp >> or >> spark.local.dir=/data01/tmp,/data02/tmp >> >> ? >> > > > > -- > Best Regards > > Jeff Zhang >