Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16474 )

Change subject: IMPALA-10178 Run-time profile shall report skews
......................................................................


Patch Set 22:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16474/21/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

http://gerrit.cloudera.org:8080/#/c/16474/21/be/src/util/runtime-profile.cc@1928
PS21, Line 1928:   vector<T> values;
> > In the past, stddev with a threshold of 5 served the purpose well.
Here are some examples from a profile 
QueryIDa841807823cbdc837375aa6000000000.txt involving tables in the order of 
million rows and a DoP of 212.

Hash join 08
        left child: 145.00M rows
        frag instances: 212
        stddev: 11421.5
        mean: 683975
        stddev/mean: 0.0166

Hdfs_scan 18
    fragment instances=209
    stddev = 918947
    mean = 4.45696e+06
    stddev / mean = 0.206

hash exchange 38
    fragment instances=209
    stddev=13692.9
    mean = 1.52542e+06
    Stddev/mean = 0.0089

Here stddev/mean is called coefficient of variation (CV), also known as 
relative standard deviation (RSD).  It shows the extent of variability in 
relation to the mean of the population. In our case, the less the CV, the 
better. When all values are the same and >=1, CV is 0 because stddev is 0.

If we look at these three examples above, we can see that hdfs scan at node 18 
has a CV value of 20%. That is a skew case in my opinion. Skews with other two 
are much less.

The intention of reporting skew is to reveal processing imbalance. The 
translation of the skew-ness to performance loss has to be done separately. In 
the case of filtering, my theory is that if the matching values are distributed 
evenly, and the scanners are applied evenly, the rows read should be about the 
same.

To report severe skews only for impala, maybe we can use CV (instead of stddev) 
as a threshold. Say
cv > 5% && mean over 1 million.



--
To view, visit http://gerrit.cloudera.org:8080/16474
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I91041f2856eef8293ea78f1721f97469062589a1
Gerrit-Change-Number: 16474
Gerrit-PatchSet: 22
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Fri, 25 Sep 2020 17:36:28 +0000
Gerrit-HasComments: Yes

Reply via email to