Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/7238#issuecomment-126017537
  
    Hey @viirya, sorry that I lied, this might not be the FINAL check yet...
    
    Actually I began to regret one of my previous decision, namely merging 
part-files which don't have corresponding summary files.  This is mostly 
because there are too many cases to consider if we assume summary files may be 
missing, and this makes the behavior of this configuration pretty much 
unintuitive---sometimes no part-files are merged, while sometimes some 
part-files get merged.  Parquet summary files can be missing under various 
corner cases, it's hard to explain the behavior and may confuse Spark user.  
The key problem here is that Parquet summary files are not written/accessed in 
an atomic manner.  And that's one of the most important reason why the Parquet 
team is actually trying to get rid of the summary file entirely.
    
    Since the configuration is named "respectSummaryFiles", it seems more 
natural and intuitive to assume that summary files are ALWAYS properly 
generated for ALL Parquet write jobs when this configuration is turned on.  To 
be more specific, given one or more Parquet input paths, we may find 1 or more 
summary files.  Metadata gathered by merging all these summary files should 
reflect the real schema of the given Parquet dataset.  Only in this case, we 
can really "respect" existing summary files.
    
    So my suggestion here is that, when the "respectSummaryFiles" configuration 
is turned on, we only collects all summary files, merge schemas read from them, 
and just use the merged schema as the final result schema.  And of course, this 
configuration should still be turned off by default.  We can document this 
configuration with an "expert only" tag.
    
    I still consider this configuration quite useful, because even if you got a 
dirty Parquet dataset without summary files or with incorrect summary files at 
hand, you can still repair the summary files quite easily.  Essentially you 
only need to call `ParquetOutputFormat.writeMetaDataFile`, either generates 
correct summary files for the entire dataset or deletes ill summary files if it 
fails to merge all user-defined key-value metadata.
    
    How do you think?  Again, sorry for the late review extra efforts for 
implementing all those intermediate versions...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to