Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/7209#issuecomment-118483890
  
    @Sephiroth-Lin @scwf This issue is actually much more complicated than it 
looks like. The TL;DR is that, in the early days, Parquet didn't explicitly 
specify how LIST and MAP should be constructed, and different systems and tools 
just reinvent their own wheels. The consequence is that it breaks Parquet 
interoperability. Namely, Parquet files written by system A might not be read 
by system B. The most recent [Parquet format spec] [1] tries to fix this by 
specifying LIST and MAP structures explicitly and adding 
backwards-compatibility rules ([1] [2], [2] [3]) to cover existing legacy data 
files.
    
    We are trying to make Spark SQL compatible with Parquet format spec. This 
work consists of three parts:
    
    1. Refactoring schema conversion between Parquet and Spark SQL (done, #6617)
    
       This makes Spark SQL recognizes all "weird" LIST and MAP structures in 
legacy data files. But this only fixes schema conversion. #6617 doesn't 
refactor the actual data read path. So there's an internal feature flat 
`spark.sql.parquet.followParquetFormatSpec`, and is turned off by default to 
keep consistent with the current data read path.
    
    2. Refactoring Parquet data read path
    
       After finishing this part, we are expected to able to read all kinds of 
legacy Parquet files, including the one mentioned in this PR.
    
    3. Refactoring Parquet data write path
    
       So that Spark SQL writes standard Parquet data which conform to Parquet 
format spec.
    
    I'm currently working on part 2, which fixes your problem here.  A PR will 
be sent out soon.
    
    [1]: https://github.com/apache/parquet-format
    [2]: 
https://github.com/apache/parquet-format/blob/5b806d1e855bf47f5234c768aefc000b704f43ab/LogicalTypes.md#backward-compatibility-rules
    [3]: 
https://github.com/apache/parquet-format/blob/5b806d1e855bf47f5234c768aefc000b704f43ab/LogicalTypes.md#maps


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to