We are addressing this in HIVE-14535, which eliminates all of the copies. Unfortunately, it won’t be until Jan-Feb till it is finished, and it’s a major change. I think there’s a more specific change somewhere that may eliminate one of the copies. IIRC it may ship in 2.2?
On 16/12/8, 23:52, "Palanieppan Muthiah" <[email protected]> wrote: >Hi, > >I am using Hive 2.1.0 on amazon EMR. In my 'insert overwrite' job, whose >source and target tables are both in S3, i notice 2 copies of result is >created in temp directory on S3. > >First the output of the query is written to temp directory (e.g: >ext-10000) >in S3 by the MR job. Then the MR job completes, but the hive client still >doesn't terminate. Instead i see that the entire temp directory is copied >in S3 again, into another directory (e.g: tmp-ext-10000), file by file. > >Is this a known issue? In my case, my query reads about 0.5 terabyte of >data, performs aggregation and writes back to S3. The second copy is so >slow and usually fails with NoHttpResponseException from S3. > >Let me know if this is a known issue, if there are workarounds, of if >there >are config options to avoid 2 copies. > > >Thanks, >pala
