GitHub user marmbrus opened a pull request:

    https://github.com/apache/spark/pull/11897

    [SPARK-14078] Streaming Parquet Based FileSink

    This PR adds a new `Sink` implementation that writes out Parquet files.  In 
order to correctly handle partial failures while maintaining exactly once 
semantics, the files for each batch are written out to a unique directory and 
then atomically appended to a metadata log.  When a parquet based `DataSource` 
is initialized for reading, we first check for this log directory and use it 
instead of file listing when present.
    
    Unit tests are added, as well as a stress test that checks the answer after 
non-deterministic injected failures.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marmbrus/spark fileSink

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11897.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11897
    
----
commit e861a6837382eaddd2dbd85d01cbc99c0bcd06fa
Author: Michael Armbrust <[email protected]>
Date:   2016-03-22T05:31:21Z

    WIP

commit 8e8e4c60a8b6a2289de5867767e1d8ebd21d32ba
Author: Michael Armbrust <[email protected]>
Date:   2016-03-22T18:26:13Z

    cleanup

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to