Thanks for the reply Noam. I have already tried the later point of dividing the query. But the challenge comes during the joining of the table.
Thanks and Regards Nishant Aggarwal, PMP Cell No:- +91 99588 94305 On Thu, Aug 20, 2015 at 2:19 PM, Noam Hasson <noam.has...@kenshoo.com> wrote: > Hi, > > Have you look at counters in Hadoop side? It's possible you are dealing > with a bad join which causes multiplication of items, if you see huge > number of record input/output in map/reduce phase and keeps increasing > that's probably the case. > > Another thing I would try is to divide the job into several different > smaller queries, for example start with filter only, after than join and so > on. > > Noam. > > On Thu, Aug 20, 2015 at 10:55 AM, Nishant Aggarwal <nishant....@gmail.com> > wrote: > >> Dear Hive Users, >> >> I am in process of running over a poc to one of my customer demonstrating >> the huge performance benefits of Hadoop BigData using Hive. >> >> Following is the problem statement i am stuck with. >> >> I have generate a large table with 28 columns( all are double). Table >> size on disk is 70GB (i ultimately created compressed table using ORC >> format to save disk space bringing down the table size to < 1GB) with more >> than 450Million records. >> >> In order to demonstrate a complex use case i joined this table with >> itself. Following are the queries i have used to create table and join >> query i am using. >> >> *Create Table and Loading Data, Hive parameters settigs:* >> set hive.vectorized.execution.enabled = true; >> set hive.vectorized.execution.reduce.enabled = true; >> set mapred.max.split.size=100000000; >> set mapred.min.split.size=1000000; >> set hive.auto.convert.join=false; >> set hive.enforce.sorting=true; >> set hive.enforce.bucketing=true; >> set hive.exec.dynamic.partition=true; >> set hive.exec.dynamic.partition.mode=nonstrict; >> set mapreduce.reduce.input.limit=-1; >> set hive.exec.parallel = true; >> >> CREATE TABLE huge_numeric_table_orc2(col1 double,col2 double,col3 >> double,col4 double,col5 double,col6 double,col7 double,col8 double,col9 >> double,col10 double,col11 double,col12 double,col13 double,col14 >> double,col15 double,col16 double,col17 double,col18 double,col19 >> double,col20 double,col21 double,col22 double,col23 double,col24 >> double,col25 double,col26 double,col27 double,col28 double) >> clustered by (col1) sorted by (col1) into 240 buckets >> STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); >> >> from huge_numeric_table insert overwrite table huge_numeric_table_orc2 >> select * sort by col1; >> >> >> *JOIN QUERY:* >> >> select (avg(t1.col1)*avg(t1.col6))/(avg(t1.col11)*avg(t1.col16)) as AVG5 >> from huge_numeric_table_orc2 t1 left outer join huge_numeric_table_orc2 t2 >> on t1.col1=t2.col1 where (t1.col1) > 34.11 and (t2.col1) >10.12 >> >> >> *The problem is that this query gets stuck at reducers :80-85%. and goes >> in a loop and never finishes. * >> >> Version of Hive is 1.2. >> >> Please help. >> >> >> Thanks and Regards >> Nishant Aggarwal, PMP >> Cell No:- +91 99588 94305 >> >> > > This e-mail, as well as any attached document, may contain material which > is confidential and privileged and may include trademark, copyright and > other intellectual property rights that are proprietary to Kenshoo Ltd, > its subsidiaries or affiliates ("Kenshoo"). This e-mail and its > attachments may be read, copied and used only by the addressee for the > purpose(s) for which it was disclosed herein. If you have received it in > error, please destroy the message and any attachment, and contact us > immediately. If you are not the intended recipient, be aware that any > review, reliance, disclosure, copying, distribution or use of the contents > of this message without Kenshoo's express permission is strictly prohibited.