[ 
https://issues.apache.org/jira/browse/SPARK-7791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649753#comment-14649753
 ] 

Niels Becker edited comment on SPARK-7791 at 7/31/15 9:14 PM:
--------------------------------------------------------------

I run into the same problem while saving a dataframe as parquet but on a mesos 
cluster.
Our Environment:
- Ubuntu 14
- Spark 1.4.1 prebuild for Hadoop 2.6
- GlusterFS 3.7
- Mesos 0.23.0
- Docker 1.7.1

Start _pyspark_ as _user1_ and load some data into a dataframe {{df}}. Then run 
{{df.write.format("parquet").save("/data/user1/wikipedia_test.parquet")}}
_/data_ is a GlusterFS voulme on each node
_/data/user1_ permissions:
{code}
# owner: user1
# group: user1
user::rwx
group::r-x
other::---
default:user::rwx
default:group::r-x
default:other::---
{code}

Tomasz described a workaround in 
[https://www.mail-archive.com/user@spark.apache.org/msg28820.html] which works 
but is not applicable for us. Because we need a real permission system where 
users are not allowed to write into other users folders.

I can confirm that setting {{SPARK_USER}} to either {{root}} nor {{user1}} has 
no effekt.
Running pyspark as root works.

I assume that all spark tasks are executed as root and overwrite the default 
file permissions but do not change the user.
So after the job is done the driver tries to rename the files to its final 
destination but fails because lack of permissions.


was (Author: waeco):
I run into the same problem while saving a dataframe as parquet but on a mesos 
cluster.
Our Environment:
- Ubuntu 14
- Spark 1.4.1 prebuild for Hadoop 2.6
- GlusterFS 3.7
- Mesos 0.23.0
- Docker 1.7.1

Start _pyspark_ as _sparkuser_ and load some data into a dataframe {{df}}. Then 
run {{df.write.format("parquet").save("/data/test/wikipedia_test.parquet")}}
_/data_ is a GlusterFS voulme on each node
_/data/test_ permissions:
{code}
# owner: sparkuser
# group: sparkuser
# flags: -s-
user::rwx
group::rwx
other::r-x
default:user::rwx
default:group::rwx
default:other::r-x
{code}

Tomasz described a workaround in 
[https://www.mail-archive.com/user@spark.apache.org/msg28820.html] but that 
does not work for us.
Interesting thing is that {{*.gz.parquet}} files have {noformat}root:sparkuser 
-rw-r--r--{noformat} permissions
but {{*.gz.parquet.crc}} files have {noformat}root:sparkuser 
-rw-rw-r--{noformat} permissions like they should have been.
This sugests that spark does not use default file permissions at least for 
parquet files.

I can confirm that setting {{SPARK_USER}} to either {{root}} nor {{sparkuser}} 
has no effekt.
Running pyspark as root works.

I assume that all spark tasks are executed as root and overwrite the default 
file permissions but do not change the user.
So after the job is done the driver tries to rename the files to its final 
destination but fails because lack of permissions.

> Set user for executors in standalone-mode
> -----------------------------------------
>
>                 Key: SPARK-7791
>                 URL: https://issues.apache.org/jira/browse/SPARK-7791
>             Project: Spark
>          Issue Type: Wish
>          Components: Spark Core
>            Reporter: Tomasz Früboes
>
> I'm opening this following a discussion in 
> https://www.mail-archive.com/user@spark.apache.org/msg28633.html
>  Our setup was following. Spark (1.3.1, prebuilt for hadoop 2.6, also 2.4) 
> was installed in the standalone mode and started manually from the root 
> account. Everything worked properly apart of operations  such us
> rdd.saveAsPickleFile(ofile)
> which end with exception:
> py4j.protocol.Py4JJavaError: An error occurred while calling o27.save.
> : java.io.IOException: Failed to rename 
> DeprecatedRawLocalFileStatus{path=file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/_temporary/0/task_201505191540_0009_r_000001/part-r-00002.parquet;
>  isDirectory=false; length=534; replication=1; blocksize=33554432; 
> modification_time=1432042832000; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false} to 
> file:/mnt/lustre/bigdata/med_home/tmp/test19EE/namesAndAges.parquet2/part-r-00002.parquet
>  at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:346)
> (files created in _temporary were owned by user root). It would be great if 
> spark could set the user for the executor also in standalone mode. Setting 
> SPARK_USER has no effect here.
> BTW it may be a good idea to add some warning (e.g. during spark startup) 
> that running from root account is not very healthy idea. E.g. mapping this 
> function 
> def test(x):
>    f = open('/etc/testTMF.txt', 'w')
>    return 0
> on a rdd creates a file in /etc/ (surprisingly calls like f.Write("text") end 
> with an exception)
> Thanks,
>   Tomasz



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to