Github user tgravescs commented on the pull request:
https://github.com/apache/spark/pull/204#issuecomment-39242381
thanks @pwendell. I'm going to try to look at this more detail in the
next day or so.
The MapReduce history server would be one thing to compare to. It has one
directory (done_intermediate) with sticky bit set where users write the history
files to with the permissions specified by the user (generally restrictive).
The History Server runs as a super user and copies the history files from that
done_intermediate to a done directory that is more restrictive so the world
can't read/write to it. The history server serves up the files and restricts
based on acls.
The important thing is that we make it so it can be secured and document
how users do that. If its manually create some directories and set permissions
I think that is fine for now. If Spark is creating directories we need to
make sure it does the right thing or has configs so that admins can have it set
the permissions appropriately.
Is there any infrastructure in place to manage/delete the log files? If
you are running thousands of applications a day the logs can add up pretty
quickly.
Can we add docs about the history server?
This is probably a separate jira, but it would be nice to clarify the
documentation of config spark.eventLog.dir to indicate if it can go to hdfs or
other filesystems.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---