[jira] [Created] (SPARK-33149) Why does ArrayType schema change between read/write for parquet files?

Jira Wed, 14 Oct 2020 06:09:03 -0700

Juha Iso-Sipilä created SPARK-33149:
---------------------------------------


             Summary: Why does ArrayType schema change between read/write for 
parquet files?
                 Key: SPARK-33149
                 URL: https://issues.apache.org/jira/browse/SPARK-33149
             Project: Spark
          Issue Type: Question
          Components: Input/Output
    Affects Versions: 3.0.0
            Reporter: Juha Iso-Sipilä


I have parquet files that have been produced with org.apache.parquet Java 
library (not Spark). The schema has a list of authors that's reported by 
[https://github.com/apache/parquet-mr/tree/master/parquet-tools] like this:

repeated binary authors (STRING);

If I do spark.read.parquet(input_dir).write.parquet(output_dir) and do the same 
for the output files, the authors column has been changed into:

optional group authors (LIST) {
  repeated group list {
    optional binary element (STRING);
  }
 }

It seems to mean the same thing but from schema perspective this is different.

I have other set of tools that are reading the output from this step (with real 
logic) but I fail to match the schemas. The original data works fine. Also, 
df.printSchema() shows the same for both (except for possible nullability 
change, which we can ignore for this case)

Any thoughts if this is intentional and if I have any control for this from 
within Spark?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-33149) Why does ArrayType schema change between read/write for parquet files?

Reply via email to