Re: Hive 2.1.0 - writes multiple copies of same data to temp location

Sergey Shelukhin Fri, 09 Dec 2016 10:03:12 -0800

We are addressing this in HIVE-14535, which eliminates all of the copies.
Unfortunately, it won’t be until Jan-Feb till it is finished, and it’s a
major change. I think there’s a more specific change somewhere that may
eliminate one of the copies. IIRC it may ship in 2.2?


On 16/12/8, 23:52, "Palanieppan Muthiah" <[email protected]> wrote:

>Hi,
>
>I am using Hive 2.1.0 on amazon EMR. In my 'insert overwrite' job, whose
>source and target tables are both in S3, i notice 2 copies of result is
>created in temp directory on S3.
>
>First the output of the query is written to temp directory (e.g:
>ext-10000)
>in S3 by the MR job. Then the MR job completes, but the hive client still
>doesn't terminate. Instead i see that the entire temp directory is copied
>in S3 again, into another directory (e.g: tmp-ext-10000), file by file.
>
>Is this a known issue? In my case, my query reads about 0.5 terabyte of
>data, performs aggregation and writes back to S3. The second copy is so
>slow and usually fails with NoHttpResponseException from S3.
>
>Let me know if this is a known issue, if there are workarounds, of if
>there
>are config options to avoid 2 copies.
>
>
>Thanks,
>pala

Re: Hive 2.1.0 - writes multiple copies of same data to temp location

Reply via email to