[ 
https://issues.apache.org/jira/browse/HIVE-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622800#comment-17622800
 ] 

Sungwoo Park commented on HIVE-26655:
-------------------------------------

I tested again with the latest release in the master branch (Hive on Tez), 
using 100GB TPC-DS benchmark.

 

1. Before computing statistics with ANALYZE TABLE

With hive.tez.bloom.filter.merge.threads=0, query 17 returns correct results 
(55 rows) consistently.

With hive.tez.bloom.filter.merge.threads=1, query 17 returns wrong results 
inconsistently (7 rows, 18 rows, 17 rows, and so on).

 

2. After computing statistics with ANALYZE TABLE

Query 17 returns correct results (55 rows) whether 
hive.tez.bloom.filter.merge.threads is set to 0 or 1.

 

Here is an example of running EXPLAIN ANALYZE.

 
{code:java}
0: jdbc:hive2://blue0:9852/> set hive.tez.bloom.filter.merge.threads=1;
No rows affected (0.005 seconds)
0: jdbc:hive2://blue0:9852/> EXPLAIN ANALYZE select  i_item_id ,i_item_desc 
,s_state ,count(ss_quantity) as store_sales_quantitycount ,avg(ss_quantity) as 
store_sales_quantityave ,stddev_samp(ss_quantity) as store_sales_quantitystdev 
,stddev_samp(ss_quantity)/avg(ss_quantity) as store_sales_quantitycov 
,count(sr_return_quantity) as_store_returns_quantitycount 
,avg(sr_return_quantity) as_store_returns_quantityave 
,stddev_samp(sr_return_quantity) as_store_returns_quantitystdev 
,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as 
store_returns_quantitycov ,count(cs_quantity) as catalog_sales_quantitycount 
,avg(cs_quantity) as catalog_sales_quantityave 
,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitystdev 
,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitycov from 
store_sales ,store_returns ,catalog_sales ,date_dim d1 ,date_dim d2 ,date_dim 
d3 ,store ,item where d1.d_quarter_name = '2000Q1' and d1.d_date_sk = 
ss_sold_date_sk and i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and 
ss_customer_sk = sr_customer_sk and ss_item_sk = sr_item_sk and 
ss_ticket_number = sr_ticket_number and sr_returned_date_sk = d2.d_date_sk and 
d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3') and sr_customer_sk = 
cs_bill_customer_sk and sr_item_sk = cs_item_sk and cs_sold_date_sk = 
d3.d_date_sk and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3') group by 
i_item_id ,i_item_desc ,s_state order by i_item_id ,i_item_desc ,s_state limit 
100;
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Map 1 <- Map 10 (BROADCAST_EDGE), Map 4 (BROADCAST_EDGE), Map 5 
(BROADCAST_EDGE), Map 7 (BROADCAST_EDGE), Reducer 9 (BROADCAST_EDGE) |
| Map 7 <- Map 4 (BROADCAST_EDGE), Map 8 (BROADCAST_EDGE), Reducer 6 
(BROADCAST_EDGE) |
| Map 8 <- Map 4 (BROADCAST_EDGE)                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
| Reducer 3 <- Reducer 2 (SIMPLE_EDGE)               |
| Reducer 6 <- Map 5 (CUSTOM_SIMPLE_EDGE)            |
| Reducer 9 <- Map 8 (CUSTOM_SIMPLE_EDGE)            |
|                                                    | 
| Stage-0                                            |
|   Fetch Operator                                   | 
|     limit:100                                      |
|     Stage-1                                        |
|       Reducer 3 vectorized                         |
|       File Output Operator [FS_294]                |
|         Limit [LIM_293] (rows=100/13 width=466)    |
|           Number of rows:100                       |
|           Select Operator [SEL_292] (rows=15835/13 width=466) |
|             
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14"]
 |
|           <-Reducer 2 [SIMPLE_EDGE] vectorized     |
|             SHUFFLE [RS_291]                       |
|               Select Operator [SEL_290] (rows=15835/13 width=458) |
|                 
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13"]
 |
|                 Group By Operator [GBY_289] (rows=15835/13 width=466) |
|                   
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14"],aggregations:["count(VALUE._col0)","sum(VALUE._col1)","sum(VALUE._col2)","sum(VALUE._col3)","count(VALUE._col4)","sum(VALUE._col5)","sum(VALUE._col6)","sum(VALUE._col7)","count(VALUE._col8)","sum(VALUE._col9)","sum(VALUE._col10)","sum(VALUE._col11)"],keys:KEY._col0,
 KEY._col1, KEY._col2 | |                 <-Map 1 [SIMPLE_EDGE] vectorized   |
|                   SHUFFLE [RS_288]                 |
|                     PartitionCols:_col0, _col1, _col2 |
|                     Group By Operator [GBY_287] (rows=15835/13 width=466) |
|                       
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14"],aggregations:["count(_col3)","sum(_col3)","sum(_col7)","sum(_col6)","count(_col4)","sum(_col4)","sum(_col9)","sum(_col8)","count(_col5)","sum(_col5)","sum(_col11)","sum(_col10)"],keys:_col0,
 _col1, _col2 |
|                       Select Operator [SEL_286] (rows=31670/13 width=370) |
|                         
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
 |
|                         Top N Key Operator [TNK_285] (rows=31670/13 
width=370) |
|                           keys:_col8, _col9, _col22,top n:100 |
|                           Map Join Operator [MAPJOIN_284] (rows=31670/13 
width=370) |
|                             
Conds:MAPJOIN_283._col2=RS_276._col0(Inner),Output:["_col4","_col8","_col9","_col12","_col18","_col22"]
 |
|                           <-Map 10 [BROADCAST_EDGE] vectorized |
|                             BROADCAST [RS_276]     |
|                               PartitionCols:_col0  |
|                               Select Operator [SEL_275] (rows=402/402 
width=94) |
|                                 Output:["_col0","_col1"] |
|                                 Filter Operator [FIL_274] (rows=402/402 
width=94) |
|                                   predicate:s_store_sk is not null |
|                                   TableScan [TS_32] (rows=402/402 width=94) |
|                                     
tpcds_bin_partitioned_orc_100@store,store, ACID 
table,Tbl:COMPLETE,Col:COMPLETE,Output:["s_store_sk","s_state"] |
|                           <-Map Join Operator [MAPJOIN_283] (rows=16071345/13 
width=301) |
|                               Conds:MAPJOIN_282._col1, _col0, 
_col3=RS_273._col6, _col5, 
_col7(Inner),Output:["_col2","_col4","_col8","_col9","_col12","_col18"] |
|                             <-Map 7 [BROADCAST_EDGE] vectorized |
|                               BROADCAST [RS_273]   |
|                                 PartitionCols:_col6, _col5, _col7 |
|                                 Map Join Operator [MAPJOIN_272] 
(rows=14713631/85 width=27) |
|                                   Conds:MAPJOIN_271._col0, 
_col1=RS_263._col1, 
_col0(Inner),Output:["_col2","_col5","_col6","_col7","_col8"] |
|                                 <-Map 8 [BROADCAST_EDGE] vectorized |
|                                   PARTITION_ONLY_SHUFFLE [RS_263] |
|                                     PartitionCols:_col1, _col0 |
|                                     Map Join Operator [MAPJOIN_262] 
(rows=3260639/4009059 width=17) |
|                                       
Conds:SEL_261._col4=RS_248._col0(Inner),Output:["_col0","_col1","_col2","_col3"]
 |
|                                     <-Map 4 [BROADCAST_EDGE] vectorized |
|                                       BROADCAST [RS_248] |
|                                         PartitionCols:_col0 |
|                                         Select Operator [SEL_243] 
(rows=274/274 width=8) |
|                                           Output:["_col0"] |
|                                           Filter Operator [FIL_241] 
(rows=274/274 width=98) |
|                                             predicate:((d_quarter_name) IN 
('2000Q1', '2000Q2', '2000Q3') and d_date_sk is not null) |
|                                             TableScan [TS_3] 
(rows=73049/10000 width=98) | |                                               
tpcds_bin_partitioned_orc_100@date_dim,d1, ACID 
table,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_quarter_name"] |         
   
|                                     <-Select Operator [SEL_261] 
(rows=23824410/4009059 width=34) |
|                                         
Output:["_col0","_col1","_col2","_col3","_col4"] |
|                                         Filter Operator [FIL_260] 
(rows=23824410/4009059 width=34) |
|                                           predicate:(sr_customer_sk is not 
null and sr_item_sk is not null and sr_ticket_number is not null) |
|                                           TableScan [TS_15] 
(rows=27787616/4082869 width=34) |
|                                             
tpcds_bin_partitioned_orc_100@store_returns,store_returns, ACID 
table,Tbl:COMPLETE,Col:PARTIAL,Output:["sr_item_sk","sr_customer_sk","sr_ticket_number","sr_return_quantity"]
 |
|                                 <-Map Join Operator [MAPJOIN_271] 
(rows=19265846/4651602 width=13) |
|                                     
Conds:SEL_270._col3=RS_246._col0(Inner),Output:["_col0","_col1","_col2"] |
|                                   <-Map 4 [BROADCAST_EDGE] vectorized |
|                                     BROADCAST [RS_246] |
|                                       PartitionCols:_col0 |
|                                        Please refer to the previous Select 
Operator [SEL_243] |
|                                   <-Select Operator [SEL_270] 
(rows=129307930/4651602 width=27) |
|                                       
Output:["_col0","_col1","_col2","_col3"] |
|                                       Filter Operator [FIL_269] 
(rows=129307930/4651602 width=27) |
|                                         predicate:(cs_bill_customer_sk is not 
null and cs_item_sk is not null and cs_item_sk BETWEEN 
DynamicValue(RS_39_item_i_item_sk_min) AND 
DynamicValue(RS_39_item_i_item_sk_max) and in_bloom_filter(cs_item_sk, 
DynamicValue(RS_39_item_i_item_sk_bloom_filter))) |
|                                         TableScan [TS_9] 
(rows=143277484/15932778 width=27) |
|                                           
tpcds_bin_partitioned_orc_100@catalog_sales,catalog_sales, ACID 
table,Tbl:COMPLETE,Col:PARTIAL,Output:["cs_bill_customer_sk","cs_item_sk","cs_quantity"]
 |
|                                         <-Reducer 6 [BROADCAST_EDGE] 
vectorized | 
|                                           BROADCAST [RS_268] |
|                                             Group By Operator [GBY_267] 
(rows=1/1 width=24) |
|                                               
Output:["_col0","_col1","_col2"],aggregations:["min(VALUE._col0)","max(VALUE._col1)","bloom_filter(VALUE._col2,
 expectedEntries=1000000)"] |
|                                             <-Map 5 [CUSTOM_SIMPLE_EDGE] 
vectorized |
|                                               PARTITION_ONLY_SHUFFLE [RS_259] 
|
|                                                 Group By Operator [GBY_258] 
(rows=1/4 width=24) |
|                                                   
Output:["_col0","_col1","_col2"],aggregations:["min(_col0)","max(_col0)","bloom_filter(_col0,
 expectedEntries=1000000)"] |                       
|                                                   Select Operator [SEL_257] 
(rows=204000/204000 width=8) |
|                                                     Output:["_col0"] |
|                                                     Select Operator [SEL_255] 
(rows=204000/204000 width=292) |
|                                                       
Output:["_col0","_col1","_col2"] |
|                                                       Filter Operator 
[FIL_254] (rows=204000/204000 width=292) |
|                                                         predicate:i_item_sk 
is not null |
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+ |                        
                                 TableScan [TS_6] (rows=204000/204000 
width=292) |
|                                                           
tpcds_bin_partitioned_orc_100@item,item, ACID 
table,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_item_id","i_item_desc"] |
|                             <-Map Join Operator [MAPJOIN_282] 
(rows=830796/5975672 width=284) |
|                                 
Conds:MAPJOIN_281._col0=RS_256._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col8","_col9"]
 |
|                               <-Map 5 [BROADCAST_EDGE] vectorized |
|                                 PARTITION_ONLY_SHUFFLE [RS_256] |
|                                   PartitionCols:_col0 |
|                                    Please refer to the previous Select 
Operator [SEL_255] |
|                               <-Map Join Operator [MAPJOIN_281] 
(rows=11164213/5975672 width=0) |
|                                   
Conds:SEL_280._col5=RS_244._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4"]
 |
|                                 <-Map 4 [BROADCAST_EDGE] vectorized |
|                                   BROADCAST [RS_244] |
|                                     PartitionCols:_col0 |
|                                     Select Operator [SEL_242] (rows=91/91 
width=8) |
|                                       Output:["_col0"] |
|                                       Filter Operator [FIL_240] (rows=91/91 
width=98) |
|                                         predicate:((d_quarter_name = 
'2000Q1') and d_date_sk is not null) |
|                                          Please refer to the previous 
TableScan [TS_3] |
|                                 <-Select Operator [SEL_280] 
(rows=224023431/5975672 width=42) |
|                                     
Output:["_col0","_col1","_col2","_col3","_col4","_col5"] |
|                                     Filter Operator [FIL_279] 
(rows=224023431/5975672 width=42) |
|                                       predicate:(ss_customer_sk is not null 
and ss_item_sk is not null and ss_ticket_number is not null and ss_store_sk is 
not null and ss_item_sk BETWEEN DynamicValue(RS[200]_col0) AND 
DynamicValue(RS[200]_col1) and ss_customer_sk BETWEEN 
DynamicValue(RS[200]_col2) AND DynamicValue(RS[200]_col3) and 
in_bloom_filter(hash(ss_item_sk,ss_customer_sk), DynamicValue(RS[200]_col4))) |
|                                       TableScan [TS_0] 
(rows=275041999/8422908 width=42) |
|                                         
tpcds_bin_partitioned_orc_100@store_sales,store_sales, ACID 
table,Tbl:COMPLETE,Col:PARTIAL,Output:["ss_item_sk","ss_customer_sk","ss_store_sk","ss_ticket_number","ss_quantity"]
 |
|                                       <-Reducer 9 [BROADCAST_EDGE] vectorized 
|
|                                         BROADCAST [RS_278] |
|                                           Group By Operator [GBY_277] 
(rows=1/1 width=176) |
|                                             
Output:["_col0","_col1","_col2","_col3","_col4"],aggregations:["min(VALUE._col0)","max(VALUE._col1)","min(VALUE._col2)","max(VALUE._col3)","bloom_filter(VALUE._col4,
 expectedEntries=1000000)"] |
|                                                    |
+----------------------------------------------------+
 {code}
 

 

> TPC-DS query 17 returns wrong results
> -------------------------------------
>
>                 Key: HIVE-26655
>                 URL: https://issues.apache.org/jira/browse/HIVE-26655
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Sungwoo Park
>            Priority: Major
>
> When tested with 100GB ORC tables, the number of rows returned by query 17 is 
> not stable. It returns fewer rows than the correct result (55 rows).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to