>From the text below it seems like you are not collecting table level stats. You can collect table level stats by following analyze table <table name> compute statistics;
Thanks John On 8/24/15, 6:24 PM, "Raajay" <raaja...@gmail.com> wrote: >Hi John, > >I am on Hive-2.0.0. I forked of the Hive master branch 2 weeks back >(commit id: 763cb02b5eafb0ecd3fd0eb512636a1b092df671). > >I actually have "analyze" before I execute the query. I left it out for >brevity. Please find the entire query (sent to hive in a file) below. >Without the analyze commands, I find that CBO optimization is ignored as >expected. Perhaps I am missing some configuration. > >I print out the calcite optimized plans, using the "RelOptUtil.toString()" >helper on "calciteOptimizedPlan" at the end of "apply" function in >CalcitePlannerAction. > > >- Raajay > > > >Query >===== > >-- Set the hive configuration > >-- clear out the existings tables >DROP TABLE tableA; >DROP TABLE tableB; >DROP TABLE tableC; >DROP TABLE output_tab; > >-- create the tables and load the data >create external table tableA (a_day int, a_product string, a_alternate >string, a_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED >AS TEXTFILE; >LOAD DATA LOCAL INPATH >'${hiveconf:CODE_DIR}/data/test/tableA_sc1000_nd7.txt' OVERWRITE INTO >TABLE >tableA; > >create external table tableB (b_day int, b_product string, b_alternate >string, b_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED >AS TEXTFILE; >LOAD DATA LOCAL INPATH >'${hiveconf:CODE_DIR}/data/test/tableB_sc1000_nd7.txt' OVERWRITE INTO >TABLE >tableB; > >create external table tableC (c_day int, c_product string, c_alternate >string, c_sales int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED >AS TEXTFILE; >LOAD DATA LOCAL INPATH >'${hiveconf:CODE_DIR}/data/test/tableC_sc1000_nd7.txt' OVERWRITE INTO >TABLE >tableC; > >-- compute statistics to be used by calcite CBO >analyze table tableA compute statistics for columns; >analyze table tableB compute statistics for columns; >analyze table tableC compute statistics for columns; > >-- create output tables >create table output_tab (a_day int, a_product string, a_alternate string, >total_sales int); > >-- the query >insert overwrite table output_tab >select >a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as >total_sales >from >tableA a join tableB b >on a.a_day = b.b_day and a.a_product = b.b_product >join tableC c >on b.b_day = c.c_day and b.b_alternate = c.c_alternate; > > >On Mon, Aug 24, 2015 at 7:25 PM, John Pullokkaran < >jpullokka...@hortonworks.com> wrote: > >> In addition to col stats you also need table stats. >> >> From: John Pullokkaran <jpullokka...@hortonworks.com> >> Reply-To: "u...@hive.apache.org" <u...@hive.apache.org> >> Date: Monday, August 24, 2015 at 5:23 PM >> To: "u...@hive.apache.org" <u...@hive.apache.org> >> Cc: "dev@hive.apache.org" <dev@hive.apache.org> >> Subject: Re: CBO - get cost of the plan >> >> Raajay, >> >> You don¹t have col stats hence it assumes 1 for row count. >> What version of Hive are you on? >> >> Thanks >> John >> >> From: Raajay <raaja...@gmail.com> >> Reply-To: "u...@hive.apache.org" <u...@hive.apache.org> >> Date: Monday, August 24, 2015 at 5:19 PM >> To: "u...@hive.apache.org" <u...@hive.apache.org> >> Cc: "dev@hive.apache.org" <dev@hive.apache.org> >> Subject: CBO - get cost of the plan >> >> Hello, >> >> I am interested to get the cost of the query plans as calculated by the >> CBO. How can I get that information ? For example, consider a query >>with a >> three way join of the following form: >> >> Query >> ===== >> >> insert overwrite table output_tab >> select >> a_day, a_product, b_alternate, (a_sales + b_sales + c_sales) as >>total_sales >> from >> tableA a join tableB b >> on a.a_day = b.b_day and a.a_product = b.b_product >> join tableC c >> on b.b_day = c.c_day and b.b_alternate = c.c_alternate; >> >> >> The number of rows for tableA, tableB, and tableC are of the order of >> 10000. I believe, that by "analyzing columns" of all the tables Hive >>will >> have statistics regarding the number of rows, distinct values, etc. >> However, when I try to print out the operator tree as determined by the >> CalcitePlanner, I get the following output. >> >> Print out of the Operator Tree >> ====================== >> >> HiveProject(a_day=[$4], a_product=[$5], b_alternate=[$2], >> total_sales=[+(+($6, $3), $9)]): rowcount =* 1.0*, cumulative cost = >>{4.0 >> rows, 0.0 cpu, 0.0 io}, id = 150 >> HiveJoin(condition=[AND(=($0, $7), =($2, $8))], joinType=[inner], >> algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0, >> cumulative cost = {4.0 rows, 0.0 cpu, 0.0 io}, id = 148 >> HiveJoin(condition=[AND(=($4, $0), =($5, $1))], joinType=[inner], >> algorithm=[none], cost=[{2.0 rows, 0.0 cpu, 0.0 io}]): rowcount = 1.0, >> cumulative cost = {2.0 rows, 0.0 cpu, 0.0 io}, id = 143 >> HiveProject(b_day=[$0], b_product=[$1], b_alternate=[$2], >> b_sales=[$3]): rowcount = 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 >> io}, id = 138 >> HiveTableScan(table=[[default.tableb]]): rowcount = 1.0, >> cumulative cost = {0}, id = 44 >> HiveProject(a_day=[$0], a_product=[$1], a_sales=[$3]): rowcount = >> 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 141 >> HiveTableScan(table=[[default.tablea]]): rowcount = 1.0, >> cumulative cost = {0}, id = 42 >> HiveProject(c_day=[$0], c_alternate=[$2], c_sales=[$3]): rowcount = >> 1.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 146 >> HiveTableScan(table=[[default.tablec]]): rowcount = 1.0, >>cumulative >> cost = {0}, id = 47 >> >> >> The number of rows as displayed here is 1.0, which is clearly not the >> correct value. >> >> - Raajay. >> >> >>