Hi Rajesh,

Thanks for spending time on this.
We tried applying these patches but unfortunately it didn't help



HIVE-23485: Bound GroupByOperator stats using largest NDV among columns

HIVE-23684: Large underestimation in NDV stats when input and join cardinality 
ratio is big

HIVE-20432: Rewrite BETWEEN to IN for integer types for stats estimation

HIVE-20537 : Multi-column joins estimates with uncorrelated columns different 
in CBO and Hive

HIVE-20636: Improve number of null values estimation after outer join

HIVE-20366: TPC-DS query78 stats estimates are off for is null filter

HIVE-18140: Partitioned tables statistics can go wrong in basic stats mixed case

HIVE-20660: added patch part 2

HIVE-20660: added patch part 1

There was one observation from our side which we wanted to share and get your 
opinion on that. One of the reduce sink operator RS[11] which got less reducer 
assigned actually got the data size from it parents. The chain of edges looks 
something like this as per log:

TS[0]-FIL[73]-SEL[2]-RS[6]-JOIN[8]-SEL[9]-GBY[10]-RS[11]-GBY[12]-SEL[13]-RS[39]-JOIN[41]-FIL[42]-RS[43]-JOIN[45]-RS[46]-JOIN[48]-SEL[49]-RS[50]-SEL[51]-LIM[52]-FS[53]

TS[3]-FIL[74]-SEL[5]-RS[7]-JOIN[8]

Here two stream of data from TS[0] and TS[3] get passed all the way till RS[11] 
and beyond. What we have observed that the CountDistinct starts getting 
mismatches starting from filter operators FIL[73], FIL[74], however row count 
mismatch(in stats) starts mismatching from join operator JOIN[8] itself and the 
discrepancies gets carried forward all the way till RS[11]

From 3.1.2 log

2022-10-02T07:41:52,421 DEBUG [c7021e6c-6ebc-4b44-9349-79e834cf0ca3 
HiveServer2-Handler-Pool: Thread-170]: annotation.StatsRulesProcFactory (:()) - 
[0] STATS-JOIN[8]:  numRows: 149188770 dataSize: 15299485696 basicStatsState: 
COMPLETE colStatsState: COMPLETE colStats: {_col0= colName: _col0 colType: 
bigint countDistincts: 1610 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0 
Range: [ min: 1 max: 28000 ] isPrimaryKey: false isEstimated: false, _col2= 
colName: _col2 colType: decimal(7,2) countDistincts: 2068 numNulls: 31639899 
avgColLen: 112.0 numTrues: 0 numFalses: 0 Range: [ min: 0 max: 200 ] 
isPrimaryKey: false isEstimated: false, _col1= colName: _col1 colType: bigint 
countDistincts: 4 numNulls: 31626037 avgColLen: 8.0 numTrues: 0 numFalses: 0 
Range: [ min: 1 max: 68 ] isPrimaryKey: false isEstimated: false}

From trunk

2022-09-28T19:59:45,402 DEBUG [7c0dcbda-a89e-45cb-94f9-a85ccda010e5 
HiveServer2-Handler-Pool: Thread-120]: annotation.StatsRulesProcFactory (:()) - 
[0] STATS-JOIN[8]:  numRows: 263995240 dataSize: 29994713856 basicStatsState: 
COMPLETE colStatsState: COMPLETE colStats: {_col0= colName: _col0 colType: 
bigint countDistincts: 2848 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0 
Range: [ min: 1 max: 28000 ] isPrimaryKey: false isEstimated: false, _col2= 
colName: _col2 colType: decimal(7,2) countDistincts: 3748 numNulls: 31639899 
avgColLen: 112.0 numTrues: 0 numFalses: 0 Range: [ min: 0 max: 200 ] 
isPrimaryKey: false isEstimated: false, _col1= colName: _col1 colType: bigint 
countDistincts: 8 numNulls: 31626037 avgColLen: 8.0 numTrues: 0 numFalses: 0 
Range: [ min: 1 max: 68 ] isPrimaryKey: false isEstimated: false}

Does countDistincts play a role during join operation? I am also attaching the 
HS2 log from trunk code for better comparison.

Regards,
Brahma

From: Rajesh Balamohan <rbalamo...@apache.org>
Reply-To: "u...@hive.apache.org" <u...@hive.apache.org>
Date: Monday, 3 October 2022 at 5:35 AM
To: "u...@hive.apache.org" <u...@hive.apache.org>
Cc: "dev@hive.apache.org" <dev@hive.apache.org>
Subject: Re: TPCDS query degrade with hive-3.1.2 because of wrong estimation 
for reducers

Based on the plan, filtered output in map-1 had mis-estimates and also groupby 
operators have large misestimates.

This is causing the number of reducers to be estimated as "4" which is less for 
this query. Due to the partition factor of tez, it ends up with 8 reducer slots 
at runtime for hive 3.x.

Here are a few tickets which can help, but note that it is very risky to 
backport pointed patches for stats and CBO without complete context. This may 
have an adverse impact on other queries.

https://issues.apache.org/jira/browse/HIVE-23684<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHIVE-23684&data=05%7C01%7Cbbattula%40visa.com%7C1978a7adfb6a494768ba08daa4d3019f%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638003523457516385%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Wid5oP8esFNYApmTYQ0AtVVsXZ1vyaLRMKjEZk8VLGo%3D&reserved=0>
https://issues.apache.org/jira/browse/HIVE-20432<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHIVE-20432&data=05%7C01%7Cbbattula%40visa.com%7C1978a7adfb6a494768ba08daa4d3019f%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638003523457672652%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AZcLqHcduwJCNoSrPq3XULhuKyX1EueVrso7zNFFuWY%3D&reserved=0>
https://issues.apache.org/jira/browse/HIVE-23485<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHIVE-23485&data=05%7C01%7Cbbattula%40visa.com%7C1978a7adfb6a494768ba08daa4d3019f%7C38305e12e15d4ee888b9c4db1c477d76%7C0%7C0%7C638003523457672652%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fEw7VqFhGZM52jdUN%2B4ZZf2bGdloWro2XLdPxwLr2i0%3D&reserved=0>

On Sun, Oct 2, 2022 at 1:56 PM Battula, Brahma Reddy 
<bbatt...@visa.com<mailto:bbatt...@visa.com>> wrote:
+ Attaching the hs2 logs also.

From: "Battula, Brahma Reddy" <bbatt...@visa.com<mailto:bbatt...@visa.com>>
Date: Sunday, 2 October 2022 at 2:16 AM
To: "u...@hive.apache.org<mailto:u...@hive.apache.org>" 
<u...@hive.apache.org<mailto:u...@hive.apache.org>>
Subject: TPCDS query degrade with hive-3.1.2 because of wrong estimation for 
reducers

Hi All,

We’ve ran TPCDS queries against hive-3.1.2 and trunk(little older version). 
(Attached files suffix “a” is trunk and “v” is 3.1.2)

The query execution time is higher in hive-3.1.2 as number of the reducers 
estimated is less (8) as compared to trunk version where it’s 46.

All the hive/tez/Yarn configs are same in both clusters. Even h/w resources are 
same. And query planner is also same.

The stats in reduce sink phase are not look same.

HIVE_TRUNK_CODE - 2022-09-26T05:58:23,786 INFO  
[07243354-f941-419d-8908-45009762e67d HiveServer2-Handler-Pool: Thread-168]: 
optimizer.ConvertJoinMapJoin (:()) - Join input#1; onlineDataSize:   9628; 
Statistics: Num rows:  359 Data size: 4308  Basic stats: COMPLETE Column stats: 
COMPLETE
HIVE_3.1.2_CODE - 2022-09-27T03:39:45,116 INFO  
[2fd1493c-f1a0-4874-acac-58f28e9c21ea HiveServer2-Handler-Pool: Thread-134]: 
optimizer.ConvertJoinMapJoin (:()) - Join input#1; onlineDataSize: 325856; 
Statistics: Num rows: 8116 Data size: 97392 Basic stats: COMPLETE Column stats: 
COMPLETE

Any idea how the reducers getting underestimated.?




Reply via email to