Github user felixcheung commented on a diff in the pull request:
https://github.com/apache/spark/pull/16330#discussion_r93351538
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -104,6 +104,12 @@ class SparkHadoopUtil extends Logging {
}
val bufferSize = conf.get("spark.buffer.size", "65536")
hadoopConf.set("io.file.buffer.size", bufferSize)
+
+ if (conf.contains("spark.sql.default.derby.dir")) {
--- End diff --
@yhuai
Spark by default has derby for metastore. Generally metastore_db and
derby.log gets created by default in the current directory. This creates a
problem for more restrictive environment, such as when running as a R package
when the guideline is not to have anything written to user's space (unless
under tempdir)
Just checking now it also seems to be the case when running the pyspark
shell.
It looks like this is the new behavior since 2.0.0. Would it make sense if
we always default derby/metastore to tempdir unless it is running in an
application directory that would be cleaned out when the job is done (eg. YARN
cluster)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]