VijayBhakuni created SPARK-35592:
------------------------------------
Summary: Spark creates only "_SUCCESS" file after empty dataFrame
is saved as parquet for partitioned data
Key: SPARK-35592
URL: https://issues.apache.org/jira/browse/SPARK-35592
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.4.0
Reporter: VijayBhakuni
Whenever an empty dataframe is saved as a parquet file with partitions, the
target directory only contains _SUCCESS file.
Assuming, the dataframe has 3 columns:
some_column_1, some_column_2, some_partition_column_1
and the target location for dataframe is /user/spark/df_name
*Current Result*: /user/spark/df_name/_SUCCESS
*Expected Result*:
/user/spark/df_name/some_partition_column_1=_HIVE_DEFAULT_PARTITION_/<some_spark_generated_file_name>.snappy.parquet
where that parquet file will have the schema for the data.
This approach makes sure that any job reading this data doesn't get failed due
to:
Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema for
Parquet. It must be specified manually.
*Steps for reproduce (Scala)*:
{code:java}
// create an empty DF with schema
val inputDF = Seq(
("value1", "value2", "partition1"),
("value3", "value4", "partition2"))
.toDF("some_column_1", "some_column_2", "some_partition_column_1")
.where("1==2")
// write dataframe into partitions
inputDF.write
.partitionBy("some_partition_column_1")
.mode(SaveMode.Overwrite)
.parquet("/user/spark/df_name")
// Read dataframe
// Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema
for // Parquet. It must be specified manually.
val readDF = spark.read.parquet("/user/spark/df_name")
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]