Anitesh Minj created SPARK-56154:
------------------------------------

             Summary: [PARQUET] BINARY(ENUM) logical annotation lost during 
reading and writing back Parquet files  - ENUM silently converted  to STRING 
causing schema and data corruption
                 Key: SPARK-56154
                 URL: https://issues.apache.org/jira/browse/SPARK-56154
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.3.2, 3.3.0, 2.4.0
            Reporter: Anitesh Minj


When Spark reads a Parquet file containing a column with *BINARY physical type* 
and *ENUM logical annotation*, and writes it back to a new Parquet file, the 
ENUM logical annotation is silently lost and replaced with STRING annotation.

 

The ENUM field stores valid UTF8/ASCII bytes representing a fixed set of string 
values (e.g. "CLUBS"). Parquet viewers and downstream consumers read these as 
raw byte maps:

{"0":67,"1":76,"2":85,"3":66,"4":83}

Since Spark does not have a native Enum type, it treats the ENUM-annotated 
BINARY field as a String, so the same data is represented as:-
"CLUBS"

 

Steps to reproduce:-

1. Check the Parquet schema after reading it using ParquetFileReader: 
{code:java}
import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.parquet.hadoop.util.HadoopInputFile
import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration

val conf = spark.sparkContext.hadoopConfiguration
val path = new Path("/tmp/enum-source.parquet")
val inputFile = HadoopInputFile.fromPath(path, conf)
val sourceReader = ParquetFileReader.open(inputFile)


println(sourceReader.getFileMetaData.getSchema)
message com.test.avro.MyClass {  
 required binary suit (ENUM);
}{code}
 

 

2. Read the attached enum_source.parquet file using Spark
{code:java}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val sqlContext = spark.sqlContext
val sourceDF = sqlContext.read.parquet("/tmp/enum-source.parquet") {code}
 

3.  Write the dataframe sourceDF back to a parquet file 
{code:java}
sourceDF.write.option("compression", 
"none").mode("overwrite").parquet("/tmp/enum-target"){code}
 

4. Read the written Parquet file schema using ParquetFileReader
{code:java}
import org.apache.parquet.hadoop.ParquetFileReader 
import org.apache.parquet.hadoop.util.HadoopInputFile 
import org.apache.hadoop.fs.Path 
import org.apache.hadoop.conf.Configuration

val conf = spark.sparkContext.hadoopConfiguration
val targetPath = new Path("/tmp/enum-target/${target_file_name}.parquet")  
val targetInputFile = HadoopInputFile.fromPath(targetPath, conf)
val targetReader = ParquetFileReader.open(targetInputFile)

println(targetReader.getFileMetaData.getSchema)
message spark_schema {
 optional binary suit (STRING);
}{code}
 

 

As we can clearly see, the schema has changed and the logical annotation has 
been changed from ENUM to STRING. 
This leads to data corruption as well.

Source data (when read using a Parquet viewer):
{"suit":{color:#FF0000}{"0":83,"1":80,"2":65,"3":68,"4":69,"5":83}{color}}

Target data (after writing the DataFrame sourceDF):-
{"suit":"{color:#FF0000}SPADES"{color}}


*Expected Behavior:-*
Source Schema: 
{code:java}
required binary suit (ENUM) {code}

Target Schema: 
{code:java}
required binary suit (ENUM) {code}

Source Data:
{code:java}
 {"suit":{"0":83,"1":80,"2":65,"3":68,"4":69,"5":83}} {code}

Target Data: 
{code:java}
 {"suit":{"0":83,"1":80,"2":65,"3":68,"4":69,"5":83}} {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to