[ 
https://issues.apache.org/jira/browse/HIVE-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593020#comment-15593020
 ] 

Sahil Takiar commented on HIVE-14271:
-------------------------------------

[~cnauroth], we were actually thinking of implementing a "direct output 
committer" strategy for Hive (it would be optional of course). Any chance you 
could expand some more on what the drawbacks of this approach would be?

For the issue reported in SPARK-10063, I think you should be able to add a 
config option that says the file is only closed if the Task was successful.

I know there are other concerns with things like speculative execution and task 
retries, but Hive may be able to overcome those by making sure each task 
attempt writes to the same file on S3. Since S3 follows a last-writer-wins 
approach, and each task attempt is idempotent, there should be no data issues 
(similar approach was taken in HIVE-1620).

Thoughts?

> FileSinkOperator should not rename files to final paths when S3 is the 
> default destination
> ------------------------------------------------------------------------------------------
>
>                 Key: HIVE-14271
>                 URL: https://issues.apache.org/jira/browse/HIVE-14271
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>
> FileSinkOperator does a rename of {{outPaths -> finalPaths}} when it finished 
> writing all rows to a temporary path. The problem is that S3 does not support 
> renaming.
> Two options can be considered:
> a. Use a copy operation instead. After FileSinkOperator writes all rows to 
> outPaths, then the commit method will do a copy() call instead of move().
> b. Write row by row directly to the S3 path (see HIVE-1620). This may add 
> better performance calls, but we should take care of the cleanup part in case 
> of writing errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to