[
https://issues.apache.org/jira/browse/HIVE-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622800#comment-17622800
]
Sungwoo Park commented on HIVE-26655:
-------------------------------------
I tested again with the latest release in the master branch (Hive on Tez),
using 100GB TPC-DS benchmark.
1. Before computing statistics with ANALYZE TABLE
With hive.tez.bloom.filter.merge.threads=0, query 17 returns correct results
(55 rows) consistently.
With hive.tez.bloom.filter.merge.threads=1, query 17 returns wrong results
inconsistently (7 rows, 18 rows, 17 rows, and so on).
2. After computing statistics with ANALYZE TABLE
Query 17 returns correct results (55 rows) whether
hive.tez.bloom.filter.merge.threads is set to 0 or 1.
Here is an example of running EXPLAIN ANALYZE.
{code:java}
0: jdbc:hive2://blue0:9852/> set hive.tez.bloom.filter.merge.threads=1;
No rows affected (0.005 seconds)
0: jdbc:hive2://blue0:9852/> EXPLAIN ANALYZE select i_item_id ,i_item_desc
,s_state ,count(ss_quantity) as store_sales_quantitycount ,avg(ss_quantity) as
store_sales_quantityave ,stddev_samp(ss_quantity) as store_sales_quantitystdev
,stddev_samp(ss_quantity)/avg(ss_quantity) as store_sales_quantitycov
,count(sr_return_quantity) as_store_returns_quantitycount
,avg(sr_return_quantity) as_store_returns_quantityave
,stddev_samp(sr_return_quantity) as_store_returns_quantitystdev
,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as
store_returns_quantitycov ,count(cs_quantity) as catalog_sales_quantitycount
,avg(cs_quantity) as catalog_sales_quantityave
,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitystdev
,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitycov from
store_sales ,store_returns ,catalog_sales ,date_dim d1 ,date_dim d2 ,date_dim
d3 ,store ,item where d1.d_quarter_name = '2000Q1' and d1.d_date_sk =
ss_sold_date_sk and i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and
ss_customer_sk = sr_customer_sk and ss_item_sk = sr_item_sk and
ss_ticket_number = sr_ticket_number and sr_returned_date_sk = d2.d_date_sk and
d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3') and sr_customer_sk =
cs_bill_customer_sk and sr_item_sk = cs_item_sk and cs_sold_date_sk =
d3.d_date_sk and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3') group by
i_item_id ,i_item_desc ,s_state order by i_item_id ,i_item_desc ,s_state limit
100;
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| Plan optimized by CBO. |
| |
| Vertex dependency in root stage |
| Map 1 <- Map 10 (BROADCAST_EDGE), Map 4 (BROADCAST_EDGE), Map 5
(BROADCAST_EDGE), Map 7 (BROADCAST_EDGE), Reducer 9 (BROADCAST_EDGE) |
| Map 7 <- Map 4 (BROADCAST_EDGE), Map 8 (BROADCAST_EDGE), Reducer 6
(BROADCAST_EDGE) |
| Map 8 <- Map 4 (BROADCAST_EDGE) |
| Reducer 2 <- Map 1 (SIMPLE_EDGE) |
| Reducer 3 <- Reducer 2 (SIMPLE_EDGE) |
| Reducer 6 <- Map 5 (CUSTOM_SIMPLE_EDGE) |
| Reducer 9 <- Map 8 (CUSTOM_SIMPLE_EDGE) |
| |
| Stage-0 |
| Fetch Operator |
| limit:100 |
| Stage-1 |
| Reducer 3 vectorized |
| File Output Operator [FS_294] |
| Limit [LIM_293] (rows=100/13 width=466) |
| Number of rows:100 |
| Select Operator [SEL_292] (rows=15835/13 width=466) |
|
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14"]
|
| <-Reducer 2 [SIMPLE_EDGE] vectorized |
| SHUFFLE [RS_291] |
| Select Operator [SEL_290] (rows=15835/13 width=458) |
|
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13"]
|
| Group By Operator [GBY_289] (rows=15835/13 width=466) |
|
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14"],aggregations:["count(VALUE._col0)","sum(VALUE._col1)","sum(VALUE._col2)","sum(VALUE._col3)","count(VALUE._col4)","sum(VALUE._col5)","sum(VALUE._col6)","sum(VALUE._col7)","count(VALUE._col8)","sum(VALUE._col9)","sum(VALUE._col10)","sum(VALUE._col11)"],keys:KEY._col0,
KEY._col1, KEY._col2 | | <-Map 1 [SIMPLE_EDGE] vectorized |
| SHUFFLE [RS_288] |
| PartitionCols:_col0, _col1, _col2 |
| Group By Operator [GBY_287] (rows=15835/13 width=466) |
|
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14"],aggregations:["count(_col3)","sum(_col3)","sum(_col7)","sum(_col6)","count(_col4)","sum(_col4)","sum(_col9)","sum(_col8)","count(_col5)","sum(_col5)","sum(_col11)","sum(_col10)"],keys:_col0,
_col1, _col2 |
| Select Operator [SEL_286] (rows=31670/13 width=370) |
|
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11"]
|
| Top N Key Operator [TNK_285] (rows=31670/13
width=370) |
| keys:_col8, _col9, _col22,top n:100 |
| Map Join Operator [MAPJOIN_284] (rows=31670/13
width=370) |
|
Conds:MAPJOIN_283._col2=RS_276._col0(Inner),Output:["_col4","_col8","_col9","_col12","_col18","_col22"]
|
| <-Map 10 [BROADCAST_EDGE] vectorized |
| BROADCAST [RS_276] |
| PartitionCols:_col0 |
| Select Operator [SEL_275] (rows=402/402
width=94) |
| Output:["_col0","_col1"] |
| Filter Operator [FIL_274] (rows=402/402
width=94) |
| predicate:s_store_sk is not null |
| TableScan [TS_32] (rows=402/402 width=94) |
|
tpcds_bin_partitioned_orc_100@store,store, ACID
table,Tbl:COMPLETE,Col:COMPLETE,Output:["s_store_sk","s_state"] |
| <-Map Join Operator [MAPJOIN_283] (rows=16071345/13
width=301) |
| Conds:MAPJOIN_282._col1, _col0,
_col3=RS_273._col6, _col5,
_col7(Inner),Output:["_col2","_col4","_col8","_col9","_col12","_col18"] |
| <-Map 7 [BROADCAST_EDGE] vectorized |
| BROADCAST [RS_273] |
| PartitionCols:_col6, _col5, _col7 |
| Map Join Operator [MAPJOIN_272]
(rows=14713631/85 width=27) |
| Conds:MAPJOIN_271._col0,
_col1=RS_263._col1,
_col0(Inner),Output:["_col2","_col5","_col6","_col7","_col8"] |
| <-Map 8 [BROADCAST_EDGE] vectorized |
| PARTITION_ONLY_SHUFFLE [RS_263] |
| PartitionCols:_col1, _col0 |
| Map Join Operator [MAPJOIN_262]
(rows=3260639/4009059 width=17) |
|
Conds:SEL_261._col4=RS_248._col0(Inner),Output:["_col0","_col1","_col2","_col3"]
|
| <-Map 4 [BROADCAST_EDGE] vectorized |
| BROADCAST [RS_248] |
| PartitionCols:_col0 |
| Select Operator [SEL_243]
(rows=274/274 width=8) |
| Output:["_col0"] |
| Filter Operator [FIL_241]
(rows=274/274 width=98) |
| predicate:((d_quarter_name) IN
('2000Q1', '2000Q2', '2000Q3') and d_date_sk is not null) |
| TableScan [TS_3]
(rows=73049/10000 width=98) | |
tpcds_bin_partitioned_orc_100@date_dim,d1, ACID
table,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_quarter_name"] |
| <-Select Operator [SEL_261]
(rows=23824410/4009059 width=34) |
|
Output:["_col0","_col1","_col2","_col3","_col4"] |
| Filter Operator [FIL_260]
(rows=23824410/4009059 width=34) |
| predicate:(sr_customer_sk is not
null and sr_item_sk is not null and sr_ticket_number is not null) |
| TableScan [TS_15]
(rows=27787616/4082869 width=34) |
|
tpcds_bin_partitioned_orc_100@store_returns,store_returns, ACID
table,Tbl:COMPLETE,Col:PARTIAL,Output:["sr_item_sk","sr_customer_sk","sr_ticket_number","sr_return_quantity"]
|
| <-Map Join Operator [MAPJOIN_271]
(rows=19265846/4651602 width=13) |
|
Conds:SEL_270._col3=RS_246._col0(Inner),Output:["_col0","_col1","_col2"] |
| <-Map 4 [BROADCAST_EDGE] vectorized |
| BROADCAST [RS_246] |
| PartitionCols:_col0 |
| Please refer to the previous Select
Operator [SEL_243] |
| <-Select Operator [SEL_270]
(rows=129307930/4651602 width=27) |
|
Output:["_col0","_col1","_col2","_col3"] |
| Filter Operator [FIL_269]
(rows=129307930/4651602 width=27) |
| predicate:(cs_bill_customer_sk is not
null and cs_item_sk is not null and cs_item_sk BETWEEN
DynamicValue(RS_39_item_i_item_sk_min) AND
DynamicValue(RS_39_item_i_item_sk_max) and in_bloom_filter(cs_item_sk,
DynamicValue(RS_39_item_i_item_sk_bloom_filter))) |
| TableScan [TS_9]
(rows=143277484/15932778 width=27) |
|
tpcds_bin_partitioned_orc_100@catalog_sales,catalog_sales, ACID
table,Tbl:COMPLETE,Col:PARTIAL,Output:["cs_bill_customer_sk","cs_item_sk","cs_quantity"]
|
| <-Reducer 6 [BROADCAST_EDGE]
vectorized |
| BROADCAST [RS_268] |
| Group By Operator [GBY_267]
(rows=1/1 width=24) |
|
Output:["_col0","_col1","_col2"],aggregations:["min(VALUE._col0)","max(VALUE._col1)","bloom_filter(VALUE._col2,
expectedEntries=1000000)"] |
| <-Map 5 [CUSTOM_SIMPLE_EDGE]
vectorized |
| PARTITION_ONLY_SHUFFLE [RS_259]
|
| Group By Operator [GBY_258]
(rows=1/4 width=24) |
|
Output:["_col0","_col1","_col2"],aggregations:["min(_col0)","max(_col0)","bloom_filter(_col0,
expectedEntries=1000000)"] |
| Select Operator [SEL_257]
(rows=204000/204000 width=8) |
| Output:["_col0"] |
| Select Operator [SEL_255]
(rows=204000/204000 width=292) |
|
Output:["_col0","_col1","_col2"] |
| Filter Operator
[FIL_254] (rows=204000/204000 width=292) |
| predicate:i_item_sk
is not null |
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+ |
TableScan [TS_6] (rows=204000/204000
width=292) |
|
tpcds_bin_partitioned_orc_100@item,item, ACID
table,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_item_id","i_item_desc"] |
| <-Map Join Operator [MAPJOIN_282]
(rows=830796/5975672 width=284) |
|
Conds:MAPJOIN_281._col0=RS_256._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col8","_col9"]
|
| <-Map 5 [BROADCAST_EDGE] vectorized |
| PARTITION_ONLY_SHUFFLE [RS_256] |
| PartitionCols:_col0 |
| Please refer to the previous Select
Operator [SEL_255] |
| <-Map Join Operator [MAPJOIN_281]
(rows=11164213/5975672 width=0) |
|
Conds:SEL_280._col5=RS_244._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4"]
|
| <-Map 4 [BROADCAST_EDGE] vectorized |
| BROADCAST [RS_244] |
| PartitionCols:_col0 |
| Select Operator [SEL_242] (rows=91/91
width=8) |
| Output:["_col0"] |
| Filter Operator [FIL_240] (rows=91/91
width=98) |
| predicate:((d_quarter_name =
'2000Q1') and d_date_sk is not null) |
| Please refer to the previous
TableScan [TS_3] |
| <-Select Operator [SEL_280]
(rows=224023431/5975672 width=42) |
|
Output:["_col0","_col1","_col2","_col3","_col4","_col5"] |
| Filter Operator [FIL_279]
(rows=224023431/5975672 width=42) |
| predicate:(ss_customer_sk is not null
and ss_item_sk is not null and ss_ticket_number is not null and ss_store_sk is
not null and ss_item_sk BETWEEN DynamicValue(RS[200]_col0) AND
DynamicValue(RS[200]_col1) and ss_customer_sk BETWEEN
DynamicValue(RS[200]_col2) AND DynamicValue(RS[200]_col3) and
in_bloom_filter(hash(ss_item_sk,ss_customer_sk), DynamicValue(RS[200]_col4))) |
| TableScan [TS_0]
(rows=275041999/8422908 width=42) |
|
tpcds_bin_partitioned_orc_100@store_sales,store_sales, ACID
table,Tbl:COMPLETE,Col:PARTIAL,Output:["ss_item_sk","ss_customer_sk","ss_store_sk","ss_ticket_number","ss_quantity"]
|
| <-Reducer 9 [BROADCAST_EDGE] vectorized
|
| BROADCAST [RS_278] |
| Group By Operator [GBY_277]
(rows=1/1 width=176) |
|
Output:["_col0","_col1","_col2","_col3","_col4"],aggregations:["min(VALUE._col0)","max(VALUE._col1)","min(VALUE._col2)","max(VALUE._col3)","bloom_filter(VALUE._col4,
expectedEntries=1000000)"] |
| |
+----------------------------------------------------+
{code}
> TPC-DS query 17 returns wrong results
> -------------------------------------
>
> Key: HIVE-26655
> URL: https://issues.apache.org/jira/browse/HIVE-26655
> Project: Hive
> Issue Type: Sub-task
> Reporter: Sungwoo Park
> Priority: Major
>
> When tested with 100GB ORC tables, the number of rows returned by query 17 is
> not stable. It returns fewer rows than the correct result (55 rows).
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)