[ https://issues.apache.org/jira/browse/HIVE-13321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205023#comment-15205023 ]
Sanjay Radia commented on HIVE-13321: ------------------------------------- bq. A common pattern in MapReduce and Hive is to write all output into a temporary folder and then rename this temporary folder to match the final output location. When using some of the newer FileSystems with Hive, the performance can be improved by directly writing output and avoiding the temporary folder write & rename. Note: the temp folder was necessary to deal with failures and also with multiple attempts. Rename in traditional fs's are very low cost and involve not copy of data unless across volumes. In case of MapReduce the tmp folder is a subdir in the output folder so that the rename is not across volumes. In the cloud's object stores (like S3) the rename require a data copy (hence HADOOP-9565's proposal to add a server-side copy - but that is still an extra copy that you are trying to avoid in this Jira.) Optimization for cloud storage makes a lot of sense, but one has to deal with the failure case and multiple attempts/speculative execution; the output directory cannot be left in a mess. Could you please elaborate on how you plan to deal with failures. > Add support for different output strategies > ------------------------------------------- > > Key: HIVE-13321 > URL: https://issues.apache.org/jira/browse/HIVE-13321 > Project: Hive > Issue Type: Improvement > Reporter: Rob Leidle > > The Hadoop ecosystem has expanded to support a wider variety of data-stores > and filesystems than simply HDFS. These FileSystems have different write > atomicity and read consistency guarantees. There are enhancements we can > make to Hive to ensure Hive works even better with a wider variety of > FileSystems in the Hadoop ecosystem. We can see work going on in the Hadoop > project to robustly support these FileSystems. One such example is > HADOOP-9565 where the behavior of MapReduce output is enhanced to do what is > optimal for different FileSystems. > > A common pattern in MapReduce and Hive is to write all output into a > temporary folder and then rename this temporary folder to match the final > output location. When using some of the newer FileSystems with Hive, the > performance can be improved by directly writing output and avoiding the > temporary folder write & rename. > > The proposal is to enhance Hive to support different strategies for file > output. One such strategy would be a concept named “DirectWrite”. DirectWrite > will be optionally enabled, likely on a per-FileSystem basis. When > DirectWrite is enabled, all Hive job output will be written directly to the > output location. > > This is an umbrella JIRA for all the tasks related to this functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)