Juha Iso-Sipilä created SPARK-33149:
---------------------------------------
Summary: Why does ArrayType schema change between read/write for
parquet files?
Key: SPARK-33149
URL: https://issues.apache.org/jira/browse/SPARK-33149
Project: Spark
Issue Type: Question
Components: Input/Output
Affects Versions: 3.0.0
Reporter: Juha Iso-Sipilä
I have parquet files that have been produced with org.apache.parquet Java
library (not Spark). The schema has a list of authors that's reported by
[https://github.com/apache/parquet-mr/tree/master/parquet-tools] like this:
repeated binary authors (STRING);
If I do spark.read.parquet(input_dir).write.parquet(output_dir) and do the same
for the output files, the authors column has been changed into:
optional group authors (LIST) {
repeated group list {
optional binary element (STRING);
}
}
It seems to mean the same thing but from schema perspective this is different.
I have other set of tools that are reading the output from this step (with real
logic) but I fail to match the schemas. The original data works fine. Also,
df.printSchema() shows the same for both (except for possible nullability
change, which we can ignore for this case)
Any thoughts if this is intentional and if I have any control for this from
within Spark?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]