[
https://issues.apache.org/jira/browse/SPARK-35592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
VijayBhakuni updated SPARK-35592:
---------------------------------
Description:
Whenever an empty dataframe is saved as a parquet file with partitions, the
target directory only contains _SUCCESS file.
Assuming, the dataframe has 3 columns:
some_column_1, some_column_2, some_partition_column_1
and the target location for dataframe is /user/spark/df_name
*Current Result*: /user/spark/df_name/_SUCCESS
*Expected Result*:
/user/spark/df_name/some_partition_column_1=_HIVE_DEFAULT_PARTITION_/<some_spark_generated_file_name>.snappy.parquet
where that parquet file will have the schema for the data.
This approach makes sure that any job reading this data doesn't get failed due
to:
Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema for
Parquet. It must be specified manually.
Similar issue was created under below Jira ticket but it was only for
non-partitioned data.
https://issues.apache.org/jira/browse/SPARK-23271
We need a similar implementation for partitioned target as well.
*Steps for reproduce (Scala)*:
{code:java}
// create an empty DF with schema
val inputDF = Seq(
("value1", "value2", "partition1"),
("value3", "value4", "partition2"))
.toDF("some_column_1", "some_column_2", "some_partition_column_1")
.where("1==2")
// write dataframe into partitions
inputDF.write
.partitionBy("some_partition_column_1")
.mode(SaveMode.Overwrite)
.parquet("/user/spark/df_name")
// Read dataframe
// Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema
for // Parquet. It must be specified manually.
val readDF = spark.read.parquet("/user/spark/df_name")
{code}
was:
Whenever an empty dataframe is saved as a parquet file with partitions, the
target directory only contains _SUCCESS file.
Assuming, the dataframe has 3 columns:
some_column_1, some_column_2, some_partition_column_1
and the target location for dataframe is /user/spark/df_name
*Current Result*: /user/spark/df_name/_SUCCESS
*Expected Result*:
/user/spark/df_name/some_partition_column_1=_HIVE_DEFAULT_PARTITION_/<some_spark_generated_file_name>.snappy.parquet
where that parquet file will have the schema for the data.
This approach makes sure that any job reading this data doesn't get failed due
to:
Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema for
Parquet. It must be specified manually.
*Steps for reproduce (Scala)*:
{code:java}
// create an empty DF with schema
val inputDF = Seq(
("value1", "value2", "partition1"),
("value3", "value4", "partition2"))
.toDF("some_column_1", "some_column_2", "some_partition_column_1")
.where("1==2")
// write dataframe into partitions
inputDF.write
.partitionBy("some_partition_column_1")
.mode(SaveMode.Overwrite)
.parquet("/user/spark/df_name")
// Read dataframe
// Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema
for // Parquet. It must be specified manually.
val readDF = spark.read.parquet("/user/spark/df_name")
{code}
Summary: Spark creates only _SUCCESS file after empty dataFrame is
saved as parquet for partitioned data (was: Spark creates only "_SUCCESS" file
after empty dataFrame is saved as parquet for partitioned data)
> Spark creates only _SUCCESS file after empty dataFrame is saved as parquet
> for partitioned data
> -----------------------------------------------------------------------------------------------
>
> Key: SPARK-35592
> URL: https://issues.apache.org/jira/browse/SPARK-35592
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.4.0
> Reporter: VijayBhakuni
> Priority: Minor
>
> Whenever an empty dataframe is saved as a parquet file with partitions, the
> target directory only contains _SUCCESS file.
> Assuming, the dataframe has 3 columns:
> some_column_1, some_column_2, some_partition_column_1
> and the target location for dataframe is /user/spark/df_name
> *Current Result*: /user/spark/df_name/_SUCCESS
> *Expected Result*:
> /user/spark/df_name/some_partition_column_1=_HIVE_DEFAULT_PARTITION_/<some_spark_generated_file_name>.snappy.parquet
> where that parquet file will have the schema for the data.
> This approach makes sure that any job reading this data doesn't get failed
> due to:
> Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema
> for Parquet. It must be specified manually.
>
> Similar issue was created under below Jira ticket but it was only for
> non-partitioned data.
> https://issues.apache.org/jira/browse/SPARK-23271
> We need a similar implementation for partitioned target as well.
>
> *Steps for reproduce (Scala)*:
>
> {code:java}
> // create an empty DF with schema
> val inputDF = Seq(
> ("value1", "value2", "partition1"),
> ("value3", "value4", "partition2"))
> .toDF("some_column_1", "some_column_2", "some_partition_column_1")
> .where("1==2")
> // write dataframe into partitions
> inputDF.write
> .partitionBy("some_partition_column_1")
> .mode(SaveMode.Overwrite)
> .parquet("/user/spark/df_name")
> // Read dataframe
> // Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema
> for // Parquet. It must be specified manually.
> val readDF = spark.read.parquet("/user/spark/df_name")
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]