Sahil Takiar commented on HIVE-1620:
Hey [~richcole], [~vaggarw], [~yalovyyi], [~poeppt],
As part of HIVE-14271 and HIVE-14269, we are considering implementing something
very similar to what this patch did. However, we are still debating between a
few different options. Any chance someone could comment on if this approach
worked well in production? Were there issues with this approach that caused
problems for any users?
Some of the concerns we have with the Direct Write to S3 from Hive are that the
failure semantics need to be improved when writing to S3. Hive needs to make
sure that there aren’t any dangling files left in the final table location on
S3. This isn’t really an issue for writing to HDFS because everything is
written to a temp directory and only the successfully written files get renamed
to their output location. The temp directory is then deleted at the end of the
MR job (similar concerns were raised in HIVE-14271).
According to the AWS docs, EMR 4.x took the Direct Write approach, but EMR 5.x
The docs say that the Direct Write to S3 was eliminated and that EMR 5.x
writes to a staging file on S3, and then copies the data to the final table
location on S3. Any chance someone could comment on why the approach was
changed? Were there fundamental issues with the approach that caused it to not
work well in production.
Any help / feedback on this would be greatly appreciated, since we probably
shouldn't implement the Direct Write Approach if it doesn't work well.
> Patch to write directly to S3 from Hive
> Key: HIVE-1620
> URL: https://issues.apache.org/jira/browse/HIVE-1620
> Project: Hive
> Issue Type: New Feature
> Reporter: Vaibhav Aggarwal
> Assignee: Vaibhav Aggarwal
> Attachments: HIVE-1620.patch
> We want to submit a patch to Hive which allows user to write files directly
> to S3.
> This patch allow user to specify an S3 location as the table output location
> and hence eliminates the need of copying data from HDFS to S3.
> Users can run Hive queries directly over the data stored in S3.
> This patch helps integrate hive with S3 better and quicker.
This message was sent by Atlassian JIRA