[jira] [Updated] (SPARK-35592) Spark creates only _SUCCESS file after empty dataFrame is saved as parquet for partitioned data

VijayBhakuni (Jira) Tue, 01 Jun 2021 04:15:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-35592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


VijayBhakuni updated SPARK-35592:
---------------------------------
    Description: 
Whenever an empty dataframe is saved as a parquet file with partitions, the 
target directory only contains _SUCCESS file.

Assuming, the dataframe has 3 columns:
 some_column_1, some_column_2, some_partition_column_1

and the target location for dataframe is /user/spark/df_name

*Current Result*:  /user/spark/df_name/_SUCCESS

*Expected Result*: 
/user/spark/df_name/some_partition_column_1=_HIVE_DEFAULT_PARTITION_/<some_spark_generated_file_name>.snappy.parquet

where that parquet file will have the schema for the data.

This approach makes sure that any job reading this data doesn't get failed due 
to:
 Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema for 
Parquet. It must be specified manually.

 

Similar issue was created under below Jira ticket but it was only for 
non-partitioned data.

https://issues.apache.org/jira/browse/SPARK-23271

We need a similar implementation for partitioned target as well.
  

*Steps for reproduce (Scala)*:

 
{code:java}
// create an empty DF with schema
val inputDF = Seq(
  ("value1", "value2", "partition1"),
  ("value3", "value4", "partition2"))
  .toDF("some_column_1", "some_column_2", "some_partition_column_1")
  .where("1==2")

// write dataframe into partitions
inputDF.write
  .partitionBy("some_partition_column_1")
  .mode(SaveMode.Overwrite)
  .parquet("/user/spark/df_name")


// Read dataframe
// Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
for // Parquet. It must be specified manually.
val readDF = spark.read.parquet("/user/spark/df_name")
{code}
 

 

  was:
Whenever an empty dataframe is saved as a parquet file with partitions, the 
target directory only contains _SUCCESS file.

Assuming, the dataframe has 3 columns:
 some_column_1, some_column_2, some_partition_column_1

and the target location for dataframe is /user/spark/df_name

*Current Result*:  /user/spark/df_name/_SUCCESS

*Expected Result*: 
/user/spark/df_name/some_partition_column_1=_HIVE_DEFAULT_PARTITION_/<some_spark_generated_file_name>.snappy.parquet

where that parquet file will have the schema for the data.

This approach makes sure that any job reading this data doesn't get failed due 
to:
Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema for 
Parquet. It must be specified manually.
 

*Steps for reproduce (Scala)*:

 
{code:java}
// create an empty DF with schema
val inputDF = Seq(
  ("value1", "value2", "partition1"),
  ("value3", "value4", "partition2"))
  .toDF("some_column_1", "some_column_2", "some_partition_column_1")
  .where("1==2")

// write dataframe into partitions
inputDF.write
  .partitionBy("some_partition_column_1")
  .mode(SaveMode.Overwrite)
  .parquet("/user/spark/df_name")


// Read dataframe
// Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
for // Parquet. It must be specified manually.
val readDF = spark.read.parquet("/user/spark/df_name")
{code}
 

 

        Summary: Spark creates only _SUCCESS file after empty dataFrame is 
saved as parquet for partitioned data  (was: Spark creates only "_SUCCESS" file 
after empty dataFrame is saved as parquet for partitioned data)

> Spark creates only _SUCCESS file after empty dataFrame is saved as parquet 
> for partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-35592
>                 URL: https://issues.apache.org/jira/browse/SPARK-35592
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: VijayBhakuni
>            Priority: Minor
>
> Whenever an empty dataframe is saved as a parquet file with partitions, the 
> target directory only contains _SUCCESS file.
> Assuming, the dataframe has 3 columns:
>  some_column_1, some_column_2, some_partition_column_1
> and the target location for dataframe is /user/spark/df_name
> *Current Result*:  /user/spark/df_name/_SUCCESS
> *Expected Result*: 
> /user/spark/df_name/some_partition_column_1=_HIVE_DEFAULT_PARTITION_/<some_spark_generated_file_name>.snappy.parquet
> where that parquet file will have the schema for the data.
> This approach makes sure that any job reading this data doesn't get failed 
> due to:
>  Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
> for Parquet. It must be specified manually.
>  
> Similar issue was created under below Jira ticket but it was only for 
> non-partitioned data.
> https://issues.apache.org/jira/browse/SPARK-23271
> We need a similar implementation for partitioned target as well.
>   
> *Steps for reproduce (Scala)*:
>  
> {code:java}
> // create an empty DF with schema
> val inputDF = Seq(
>   ("value1", "value2", "partition1"),
>   ("value3", "value4", "partition2"))
>   .toDF("some_column_1", "some_column_2", "some_partition_column_1")
>   .where("1==2")
> // write dataframe into partitions
> inputDF.write
>   .partitionBy("some_partition_column_1")
>   .mode(SaveMode.Overwrite)
>   .parquet("/user/spark/df_name")
> // Read dataframe
> // Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
> for // Parquet. It must be specified manually.
> val readDF = spark.read.parquet("/user/spark/df_name")
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-35592) Spark creates only _SUCCESS file after empty dataFrame is saved as parquet for partitioned data

Reply via email to