GitHub user steveloughran opened a pull request:
https://github.com/apache/spark/pull/19448
[SPARK-22217] [SQL] ParquetFileFormat to support arbitrary OutputCommitters
## What changes were proposed in this pull request?
`ParquetFileFormat` to relax its requirement of output committer class from
`org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and
implicitly Hadoop `FileOutputCommitter` to any committer implementing
`org.apache.hadoop.mapreduce.OutputCommitter`
This enables output committers which don't write to the filesystem the way
`FileOutputCommitter` does to save parquet data from a dataframe: at present
you cannot do this.
Because a committer which isn't a subclass of `ParquetOutputCommitter`, it
checks to see if the context has requested summary metadata by setting
`parquet.enable.summary-metadata`. If true, and the committer class isn't a
parquet committer, it raises a RuntimeException with an error message.
(It could downgrade, of course, but raising an exception makes it clear
there won't be an summary. It also makes the behaviour testable.)
## How was this patch tested?
The patch includes a test suite, `ParquetCommitterSuite`, with a new
committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and
writes a marker file in the destination directory. The presence of the marker
file can be used to verify the new committer was used. The tests then try the
combinations of Parquet committer summary/no-summary and marking committer
summary/no-summary.
| committer | summary | outcome |
|-----------|---------|---------|
| parquet | true | success |
| parquet | false | success |
| marking | false | success with marker |
| marking | true | exception |
All tests are happy.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/steveloughran/spark
cloud/SPARK-22217-committer
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19448.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19448
----
commit e6fdbdcf4118283abd22f7b14586ed742d225657
Author: Steve Loughran <[email protected]>
Date: 2017-07-12T10:42:51Z
SPARK-22217 tuning ParquetOutputCommitter to support any committer class,
provided saveSummaries is disabled. With Tests
Change-Id: I19872dc1c095068ed5a61985d53cb7258bd9a9bb
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]