[
https://issues.apache.org/jira/browse/SPARK-13912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267352#comment-15267352
]
Yin Huai commented on SPARK-13912:
----------------------------------
[~mcheah]
https://github.com/apache/spark/commit/5cb03220a02c70d343e82d69cfd30edb894595a1
is the fix of this issue. It has been merged and will be released with Spark
2.0.
In Spark 2.0, whenever Spark SQL needs a hadoop conf, we create one using
sparkContext.hadoopConfiguration as the base one (so, it contains confs set by
spark.hadoop.*). Also, you can provide data source specific options through the
option function of datasource reader/writer (e.g.
df.write.option("parquet.block.size", ...).format("parquet").save(...)) and we
will propagate those data source confs to the hadoop conf used by the data
source reader/writer.
> spark.hadoop.* configurations are not applied for Parquet Data Frame Readers
> ----------------------------------------------------------------------------
>
> Key: SPARK-13912
> URL: https://issues.apache.org/jira/browse/SPARK-13912
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.1
> Reporter: Matt Cheah
>
> I populated a SparkConf object passed to a SparkContext with some
> spark.hadoop.* configurations, expecting them to be used in the backing
> Hadoop file reading whenever I read from my DFS. However, when I was running
> some jobs, I noticed that the configurations were not being properly applied
> to the data frame reading when I used sqlContext.read().parquet().
> I looked in the codebase and noticed that SqlNewHadoopRDD doesn't use a
> SparkConf nor SparkContext hadoop configuration to set up the Hadoop reading;
> instead, it uses SparkHadoopUtil.get.conf. This Hadoop configuration object
> won't have Hadoop configurations set on the Spark Context. In general it
> seems like we have a discrepancy in how we set Hadoop configurations; when
> reading raw RDDs via e.g. SparkContext.textFile() we take the Hadoop
> configuration from the Spark Context, but for Data Frames we use
> SparkHadoopUtil.conf.
> We should probably use the Spark Context hadoop configuration for Data Frames
> as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]