Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/7209#issuecomment-118483890
@Sephiroth-Lin @scwf This issue is actually much more complicated than it
looks like. The TL;DR is that, in the early days, Parquet didn't explicitly
specify how LIST and MAP should be constructed, and different systems and tools
just reinvent their own wheels. The consequence is that it breaks Parquet
interoperability. Namely, Parquet files written by system A might not be read
by system B. The most recent [Parquet format spec] [1] tries to fix this by
specifying LIST and MAP structures explicitly and adding
backwards-compatibility rules ([1] [2], [2] [3]) to cover existing legacy data
files.
We are trying to make Spark SQL compatible with Parquet format spec. This
work consists of three parts:
1. Refactoring schema conversion between Parquet and Spark SQL (done, #6617)
This makes Spark SQL recognizes all "weird" LIST and MAP structures in
legacy data files. But this only fixes schema conversion. #6617 doesn't
refactor the actual data read path. So there's an internal feature flat
`spark.sql.parquet.followParquetFormatSpec`, and is turned off by default to
keep consistent with the current data read path.
2. Refactoring Parquet data read path
After finishing this part, we are expected to able to read all kinds of
legacy Parquet files, including the one mentioned in this PR.
3. Refactoring Parquet data write path
So that Spark SQL writes standard Parquet data which conform to Parquet
format spec.
I'm currently working on part 2, which fixes your problem here. A PR will
be sent out soon.
[1]: https://github.com/apache/parquet-format
[2]:
https://github.com/apache/parquet-format/blob/5b806d1e855bf47f5234c768aefc000b704f43ab/LogicalTypes.md#backward-compatibility-rules
[3]:
https://github.com/apache/parquet-format/blob/5b806d1e855bf47f5234c768aefc000b704f43ab/LogicalTypes.md#maps
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]