Sahil Takiar commented on HIVE-1620:

Hey [~richcole], [~vaggarw], [~yalovyyi], [~poeppt],

As part of HIVE-14271 and HIVE-14269, we are considering implementing something 
very similar to what this patch did. However, we are still debating between a 
few different options. Any chance someone could comment on if this approach 
worked well in production? Were there issues with this approach that caused 
problems for any users?

Some of the concerns we have with the Direct Write to S3 from Hive are that the 
failure semantics need to be improved when writing to S3. Hive needs to make 
sure that there aren’t any dangling files left in the final table location on 
S3. This isn’t really an issue for writing to HDFS because everything is 
written to a temp directory and only the successfully written files get renamed 
to their output location. The temp directory is then deleted at the end of the 
MR job (similar concerns were raised in HIVE-14271). 

According to the AWS docs, EMR 4.x took the Direct Write approach, but EMR 5.x 
doesn't (ref: 
 The docs say that the Direct Write to S3 was eliminated and that EMR 5.x 
writes to a staging file on S3, and then copies the data to the final table 
location on S3. Any chance someone could comment on why the approach was 
changed? Were there fundamental issues with the approach that caused it to not 
work well in production.

Any help / feedback on this would be greatly appreciated, since we probably 
shouldn't implement the Direct Write Approach if it doesn't work well.

> Patch to write directly to S3 from Hive
> ---------------------------------------
>                 Key: HIVE-1620
>                 URL: https://issues.apache.org/jira/browse/HIVE-1620
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Vaibhav Aggarwal
>            Assignee: Vaibhav Aggarwal
>         Attachments: HIVE-1620.patch
> We want to submit a patch to Hive which allows user to write files directly 
> to S3.
> This patch allow user to specify an S3 location as the table output location 
> and hence eliminates the need  of copying data from HDFS to S3.
> Users can run Hive queries directly over the data stored in S3.
> This patch helps integrate hive with S3 better and quicker.

This message was sent by Atlassian JIRA

Reply via email to