[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...

steveloughran Fri, 06 Oct 2017 09:15:49 -0700

GitHub user steveloughran opened a pull request:

    https://github.com/apache/spark/pull/19448


    [SPARK-22217] [SQL] ParquetFileFormat to support arbitrary OutputCommitters

    ## What changes were proposed in this pull request?
    
    `ParquetFileFormat` to relax its requirement of output committer class from 
`org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and 
implicitly Hadoop `FileOutputCommitter` to any committer implementing 
`org.apache.hadoop.mapreduce.OutputCommitter`
    
    This enables output committers which don't write to the filesystem the way 
`FileOutputCommitter` does to save parquet data from a dataframe: at present 
you cannot do this.
    
    Because a committer which isn't a subclass of `ParquetOutputCommitter`, it 
checks to see if the context has requested summary metadata by setting 
`parquet.enable.summary-metadata`. If true, and the committer class isn't a 
parquet committer, it raises a RuntimeException with an error message.
    
    (It could downgrade, of course, but raising an exception makes it clear 
there won't be an summary. It also makes the behaviour testable.)
    
    ## How was this patch tested?
    
    The patch includes a test suite, `ParquetCommitterSuite`, with a new 
committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and 
writes a marker file in the destination directory. The presence of the marker 
file can be used to verify the new committer was used. The tests then try the 
combinations of Parquet committer summary/no-summary and marking committer 
summary/no-summary. 
    
    | committer | summary | outcome |
    |-----------|---------|---------|
    | parquet   | true    | success |
    | parquet   | false   | success |
    | marking   | false   | success with marker |
    | marking   | true    | exception |
    
    All tests are happy.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/spark 
cloud/SPARK-22217-committer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19448.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19448
    
----
commit e6fdbdcf4118283abd22f7b14586ed742d225657
Author: Steve Loughran <ste...@hortonworks.com>
Date:   2017-07-12T10:42:51Z

    SPARK-22217 tuning ParquetOutputCommitter to support any committer class, 
provided saveSummaries is disabled. With Tests
    
    Change-Id: I19872dc1c095068ed5a61985d53cb7258bd9a9bb

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...

Reply via email to