Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16474 )

Change subject: IMPALA-10178 Run-time profile shall report skews
......................................................................


Patch Set 15:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16474/15//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16474/15//COMMIT_MSG@22
PS15, Line 22:   3. RowsReturned in GroupingAggregator profile
It would be good to add this for sort operations too. We have SortDataSize 
already, which would be an OK metric, or I guess we could add a count of the 
rows in the sorter. Could be a follow-on patch but might be good to include 
here.

It would also be good to include RowsReturned for exchanges, since that could 
be another source of skew.


http://gerrit.cloudera.org:8080/#/c/16474/15//COMMIT_MSG@36
PS15, Line 36:         skew(s) found at: HASH_JOIN_NODE (id=4), HDFS_SCAN_NODE 
(id=0)
I thought a bit about whether using the info string was the right approach (as 
opposed to adding it to the thrift in a more structure way) and I think this 
makes sense, since all the tools can already handle info strings.


http://gerrit.cloudera.org:8080/#/c/16474/15/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

http://gerrit.cloudera.org:8080/#/c/16474/15/be/src/util/runtime-profile-counters.h@415
PS15, Line 415: o.
can you explicitly say that it's returned in 'details'.

When does it return true/false?


http://gerrit.cloudera.org:8080/#/c/16474/15/be/src/util/runtime-profile-counters.h@416
PS15, Line 416:
nit: convention is to use pointer for output args.


http://gerrit.cloudera.org:8080/#/c/16474/15/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

http://gerrit.cloudera.org:8080/#/c/16474/15/be/src/util/runtime-profile.cc@1969
PS15, Line 1969:   int num_valid_values = NumValidValues();
I don't think this quite works since valid values could be added concurrently 
with this method executes. I think you instead need to do a single pass over 
the array prepend the ", " if it's not the first.


http://gerrit.cloudera.org:8080/#/c/16474/15/be/src/util/stat-util.h
File be/src/util/stat-util.h:

http://gerrit.cloudera.org:8080/#/c/16474/15/be/src/util/stat-util.h@28
PS15, Line 28:   /// Computes standard deviation given mean
Is this the population standard deviation or the sample standard deviation? 
Would be good to document in the comment cause it's caused confusion in the 
past when it's ambiguous.

I don't know which is the right one to use in this context and it probably 
doesn't matter for the skew threshold. So ok to punt on that.



--
To view, visit http://gerrit.cloudera.org:8080/16474
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I91041f2856eef8293ea78f1721f97469062589a1
Gerrit-Change-Number: 16474
Gerrit-PatchSet: 15
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Wed, 23 Sep 2020 03:40:36 +0000
Gerrit-HasComments: Yes

Reply via email to