Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16474 )
Change subject: IMPALA-10178 Run-time profile shall report skews ...................................................................... Patch Set 22: (1 comment) http://gerrit.cloudera.org:8080/#/c/16474/21/be/src/util/runtime-profile.cc File be/src/util/runtime-profile.cc: http://gerrit.cloudera.org:8080/#/c/16474/21/be/src/util/runtime-profile.cc@1928 PS21, Line 1928: vector<T> values; > > In the past, stddev with a threshold of 5 served the purpose well. Here are some examples from a profile QueryIDa841807823cbdc837375aa6000000000.txt involving tables in the order of million rows and a DoP of 212. Hash join 08 left child: 145.00M rows frag instances: 212 stddev: 11421.5 mean: 683975 stddev/mean: 0.0166 Hdfs_scan 18 fragment instances=209 stddev = 918947 mean = 4.45696e+06 stddev / mean = 0.206 hash exchange 38 fragment instances=209 stddev=13692.9 mean = 1.52542e+06 Stddev/mean = 0.0089 Here stddev/mean is called coefficient of variation (CV), also known as relative standard deviation (RSD). It shows the extent of variability in relation to the mean of the population. In our case, the less the CV, the better. When all values are the same and >=1, CV is 0 because stddev is 0. If we look at these three examples above, we can see that hdfs scan at node 18 has a CV value of 20%. That is a skew case in my opinion. Skews with other two are much less. The intention of reporting skew is to reveal processing imbalance. The translation of the skew-ness to performance loss has to be done separately. In the case of filtering, my theory is that if the matching values are distributed evenly, and the scanners are applied evenly, the rows read should be about the same. To report severe skews only for impala, maybe we can use CV (instead of stddev) as a threshold. Say cv > 5% && mean over 1 million. -- To view, visit http://gerrit.cloudera.org:8080/16474 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I91041f2856eef8293ea78f1721f97469062589a1 Gerrit-Change-Number: 16474 Gerrit-PatchSet: 22 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Fri, 25 Sep 2020 17:36:28 +0000 Gerrit-HasComments: Yes
