[
https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
chendihao updated SPARK-30328:
------------------------------
Description:
We find that the incorrect Hadoop configuration files cause the failure of
saving RDD to local file system. It is not expected because we have specify the
local url and the API of DataFrame.write.text does not have this issue. It is
easy to reproduce and verify with Spark 2.3.0.
1.Do not set environment variable of `HADOOP_CONF_DIR`.
2.Install pyspark and run the local Python script. This should work and save
files to local file system.
{code:java}
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
rdd.saveAsTextFile("file:///tmp/rdd.text")
{code}
3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop
configuration files there. Make sure the format of `core-site.xml` is right but
it has an unresolved host name.
4.Run the same Python script again. If it try to connect HDFS and found the
unresolved host name, Java exception happens.
We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not
matter `HADOOP_CONF_DIR` is set. Actually the following code will work with the
same incorrect Hadoop configuration files.
{code:java}
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").getOrCreate()
df = spark.createDataFrame(rows, ["attribute", "value"])
df.write.parquet("file:///tmp/df.parquet")
{code}
> Fail to write local files with RDD.saveTextFile when setting the incorrect
> Hadoop configuration files
> -----------------------------------------------------------------------------------------------------
>
> Key: SPARK-30328
> URL: https://issues.apache.org/jira/browse/SPARK-30328
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.3.0
> Reporter: chendihao
> Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of
> saving RDD to local file system. It is not expected because we have specify
> the local url and the API of DataFrame.write.text does not have this issue.
> It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save
> files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop
> configuration files there. Make sure the format of `core-site.xml` is right
> but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the
> unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS not
> matter `HADOOP_CONF_DIR` is set. Actually the following code will work with
> the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]