Thanks for the reply Noam. I have already tried the later point of dividing
the query. But the challenge comes during the joining of the table.


Thanks and Regards
Nishant Aggarwal, PMP
Cell No:- +91 99588 94305


On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson <noam.has...@kenshoo.com>
wrote:

> Hi,
>
> Have you look at counters in Hadoop side? It's possible you are dealing
> with a bad join which causes multiplication of items, if you see huge
> number of record input/output in map/reduce phase and keeps increasing
> that's probably the case.
>
> Another thing I would try is to divide the job into several different
> smaller queries, for example start with filter only, after than join and so
> on.
>
> Noam.
>
> On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal <nishant....@gmail.com>
> wrote:
>
>> Dear Hive Users,
>>
>> I am in process of running over a poc to one of my customer demonstrating
>> the huge performance benefits of Hadoop BigData using Hive.
>>
>> Following is the problem statement i am stuck with.
>>
>> I have generate a large table with 28 columns( all are double). Table
>> size on disk is 70GB (i ultimately created compressed table using ORC
>> format to save disk space bringing down the table size to < 1GB) with more
>> than 450Million records.
>>
>> In order to demonstrate a complex use case i joined this table with
>> itself. Following are the queries i have used to create table and  join
>> query i am using.
>>
>> *Create Table and Loading Data, Hive parameters settigs:*
>> set hive.vectorized.execution.enabled = true;
>> set hive.vectorized.execution.reduce.enabled = true;
>> set mapred.max.split.size=100000000;
>> set mapred.min.split.size=1000000;
>> set hive.auto.convert.join=false;
>> set hive.enforce.sorting=true;
>> set hive.enforce.bucketing=true;
>> set hive.exec.dynamic.partition=true;
>> set hive.exec.dynamic.partition.mode=nonstrict;
>> set mapreduce.reduce.input.limit=-1;
>> set hive.exec.parallel = true;
>>
>> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3
>> double,col4 double,col5 double,col6 double,col7 double,col8 double,col9
>> double,col10 double,col11 double,col12 double,col13 double,col14
>> double,col15 double,col16 double,col17 double,col18 double,col19
>> double,col20 double,col21 double,col22 double,col23 double,col24
>> double,col25 double,col26 double,col27 double,col28 double)
>> clustered by (col1) sorted by (col1) into 240 buckets
>> STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
>>
>> from huge_numeric_table insert overwrite table huge_numeric_table_orc2
>> select * sort by col1;
>>
>>
>> *JOIN QUERY:*
>>
>> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5
>> from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2
>> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12
>>
>>
>> *The problem is that this query gets stuck at reducers :80-85%. and goes
>> in a loop and never finishes. *
>>
>> Version of Hive is 1.2.
>>
>> Please help.
>>
>>
>> Thanks and Regards
>> Nishant Aggarwal, PMP
>> Cell No:- +91 99588 94305
>>
>>
>
> This e-mail, as well as any attached document, may contain material which
> is confidential and privileged and may include trademark, copyright and
> other intellectual property rights that are proprietary to Kenshoo Ltd,
>  its subsidiaries or affiliates ("Kenshoo"). This e-mail and its
> attachments may be read, copied and used only by the addressee for the
> purpose(s) for which it was disclosed herein. If you have received it in
> error, please destroy the message and any attachment, and contact us
> immediately. If you are not the intended recipient, be aware that any
> review, reliance, disclosure, copying, distribution or use of the contents
> of this message without Kenshoo's express permission is strictly prohibited.

Reply via email to