[
https://issues.apache.org/jira/browse/SPARK-18024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reynold Xin updated SPARK-18024:
--------------------------------
Summary: Introduce an internal commit protocol API along with
OutputCommitter implementation (was: Introduce a commit protocol API along
with OutputCommitter implementation)
> Introduce an internal commit protocol API along with OutputCommitter
> implementation
> -----------------------------------------------------------------------------------
>
> Key: SPARK-18024
> URL: https://issues.apache.org/jira/browse/SPARK-18024
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Reynold Xin
> Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> This commit protocol API should wrap around Hadoop's output committer. Later
> we can expand the API to cover streaming commits.
> The existing Hadoop output committer API is insufficient for streaming use
> cases:
> 1. It has no way for tasks to pass information back to the driver.
> 2. It relies on the weird Hadoop hashmap to pass information from the driver
> to the executors, largely because there is no support for language
> integration and serialization in Hadoop MapReduce. Spark has more natural
> support for passing information through automatic closure serialization.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]