[Impala-ASF-CR] IMPALA-7550: Add documentation to profile counters

Balazs Jeszenszky (Code Review) Thu, 05 Dec 2019 06:31:35 -0800

Balazs Jeszenszky has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14776 )


Change subject: IMPALA-7550: Add documentation to profile counters
......................................................................


Patch Set 5:

(28 comments)

Looks nice. I haven't looked at the significance fields for individual counters 
yet. In general, all but DEBUG descriptions should avoid Impala-internal 
terminology as much as possible. The target audience will never want to open 
Impala's code, so referencing function names, etc. will be frustrating. Some 
are harder to word around than others (e.g. scan range), I skipped those for 
now.

http://gerrit.cloudera.org:8080/#/c/14776/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14776/5//COMMIT_MSG@37
PS5, Line 37: 2. Profile counters are annotated with their stability:
            : * Stable counters - generally useful to understand query 
performance,
            : should only change rarely and if it does we'll make some effort to
            : notify users. E.g. BytesRead.
            : * Unstable but useful - useful to understand query performance, 
but
            : subject to change, particularly if the implementation changes. 
E.g.
            : RowBatchQueuePutWaitTime, MaterializeTupleTimer
            : * Debugging counters - generally not useful to users of Impala, 
the main
            : use case is low-level debugging. Can be hidden to reduce noise 
for most
            : consumers of profiles.
            :
            : 3. Profile counters are also annotated with their significance to 
users.
            : * Critical level counters - always useful on measuring query 
performance and status.
            : Counters that everyone are interested.
            : * High level counters - generally interesting counters. Most of 
the users will be
            : interested and all the developers are very interested.
            : * Medium level counters - somehow interesting counters to 
monitor. It will probably be
            : interesting under some circumstance. Lot of developers are 
interested.
            : * Low level counters - not interesting to users. Should be useful 
for developers
            : to debug only.
Please simplify this to what's in the code - I think the descriptions in 
Significance are a good explanation of the different levels.


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/hdfs-scan-node-base.cc@70
PS5, Line 70: by Disk I/O threads in HDFS read operations. For example, if we 
have 3 reading "
            :     "threads and each spent 1 sec, this counter will report 3 
sec.")
Instead of including an explanation in individual description (but not in e.g. 
the next one), I'd stick to a common language that's clear enough by itself for 
all 'wall clock timers over multiple threads' descriptions. Maybe 'Aggregate 
wall clock time across all Disk I/O threads...'?


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/hdfs-scan-node-base.cc@79
PS5, Line 79: concept of \"scan range\", including HDFS and Kudu.
Is this included in profiles where it's irrelevant? If no, remove this to avoid 
confusion. If yes, it'd be better to enumerate the full list since 'scan range' 
is not a user-facing term.


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/hdfs-scan-node-base.cc@85
PS5, Line 85: disk queue
Maybe '...remote data source...'? Disk queue is not a term users will be 
familiar with. Or, again, just enumerate.
Is HDFS remote read overall a single disk queue, or individual remote read 
targets? Same for S3.


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/hdfs-scan-node-base.cc@88
PS5, Line 88: Higher values
Than what? - Is there a point of comparison we can provide? For example, 
'values close to the number of disks accessed'?


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/hdfs-scan-node-base.cc@89
PS5, Line 89: thread
scan


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/hdfs-scan-node-base.cc@90
PS5, Line 90: because of
I don't think we should speculate on root causes - there can be many reasons 
for this.


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/hdfs-scan-node-base.cc@121
PS5, Line 121: traditional HDFS scan nodes and the scan "
             :     "node total time for the MT_DOP > 1 scan nodes
Isn't this the same as 'HDFS scans'?


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/hdfs-scan-node-base.cc@122
PS5, Line 122: Low values show
Don't scanner threads shut down if there's no work to do? IIUC, slow reads 
result in low ScannerIoWaitTime and also low AverageNumScannerThreads. If 
that's correct, I'd just focus on interpreting high values.


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/hdfs-scan-node-base.cc@125
PS5, Line 125: Note that if CPU load is high, this "
             :     "can include time that
replace: Note that this includes the time when


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/hdfs-scan-node-base.cc@130
PS5, Line 130: The scan node tracks the number of bytes read for each column it 
"
             :     "processes, and when the scan node is closed, it updates 
these counters with the "
             :     "size of each column
remove?


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/hdfs-scan-node-base.cc@136
PS5, Line 136: The scan node tracks the number of bytes read for each column it 
"
             :     "processes, and when the scan node is closed, it updates 
these counters with the "
             :     "size of each column.
remove?


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/scan-node.cc
File be/src/exec/scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/scan-node.cc@57
PS5, Line 57: Provided as a counter as well as a time series that samples the"
            :     " counter
I'd leave this to the BytesReadSeries description.


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/scan-node.cc@58
PS5, Line 58: Only implemented for scan node subclasses that expose the bytes 
read, "
            :     "e.g. HDFS and HBase
If this refers to the time series version, move there. If the time series is 
not included where it can't be populated, remove entirely.


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/scan-node.cc@63
PS5, Line 63: UNSTABLE
STABLE_LOW?


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/scan-node.cc@66
PS5, Line 66: read from disks. If this is the only scan executing, ideally this 
will "
            :     "approach the maximum bandwidth supported by the disks.
This also includes time spent in the scanners, ie. applying predicates, making 
this not directly dependent on read speed. I'd just say '..aggregate rate data 
is scanned' and leave it at that.


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/scan-node.cc@74
PS5, Line 74:  This is at most the number of scan ranges
More than one scanner thread can be spun up to process data from a single scan 
range, so this counter is not capped at scan range count.


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/scan-node.cc@85
PS5, Line 85: Wall clock time that the "
            :     "scanner threads
Aggregate wall clock time across all scanner threads [...]


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/exec/scan-node.cc@94
PS5, Line 94: "This should be created in Open and stopped when all the scanner 
threads are done."
remove?


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/runtime/coordinator.cc@88
PS5, Line 88: in exchange nodes
I'd be hand wavy and add '(across the network)' to help interpret this without 
having to understand exchanges.


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/runtime/coordinator.cc@91
PS5, Line 91: sent
clarify: sent 'across the network' (exchange), or just returned further up the 
tree?


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/runtime/coordinator.cc@93
PS5, Line 93: sent
clarify, as above


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/runtime/coordinator.cc@97
PS5, Line 97: i.e. the selectivity over "
            :     "all fragment instances that had a scan node in their plan."
If both counters' descriptions are correct, I think this is not true. A scan on 
the right side of a broadcast join will multiply the bytes returned by the scan 
with the number of hosts for TotalScanByteSent, potentially having a larger 
TotalScanByteSent than the amount of bytes read by the scan.


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/runtime/coordinator.cc@104
PS5, Line 104: Updated for each Backend in 'UpdateBackendExecStatus' when "
             :     "'ApplyExecStatusReport' returns true.
remove?


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/runtime/coordinator.cc@105
PS5, Line 105: Only valid after InitBackendStates() is "
             :     "called
Can this be rephrased in less specific terms, e.g. 'after all backends have 
started executing', or something similar?


http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

http://gerrit.cloudera.org:8080/#/c/14776/5/be/src/util/runtime-profile-counters.h@146
PS5, Line 146: sable
nit: stable


http://gerrit.cloudera.org:8080/#/c/14776/5/www/profile_docs.tmpl
File www/profile_docs.tmpl:

http://gerrit.cloudera.org:8080/#/c/14776/5/www/profile_docs.tmpl@29
PS5, Line 29: <th>Name</th>
            :           <th>Stability</th>
            :           <th>Significance</th>
            :           <th>Unit</th>
            :           <th>Description</th>
            :         </tr>
            :       </thead>
            :       <tbody>
            :         {{#profile_docs}}
            :         <tr>
            :           <td>
            :           {{name}}
            :           </td>
            :           <td>
            :           {{significance}}
            :           </td>
            :           <td>
            :           {{unit}}
            :           </td>
            :           <td>
            :           {{description}}
            :           </td>
Looks like stability and significance needs to be merged in the header. Please 
add descriptions of the stability-significance merged field (expose the 
comments from the enum as a second field?).


http://gerrit.cloudera.org:8080/#/c/14776/5/www/query_profile.tmpl
File www/query_profile.tmpl:

http://gerrit.cloudera.org:8080/#/c/14776/5/www/query_profile.tmpl@40
PS5, Line 40:  <h4>
            :         <a style="font-size:16px;"
            :             href="{{ __common__.host-url }}/profile_docs">Profile 
Documentation
            :         </a>
            :     </h4>
The profile counters should be referenced from outside the context of a single 
profile since they are useful in general and most people don't use the webUI to 
browse profiles. It'd be better in common-header or directly under /queries.



--
To view, visit http://gerrit.cloudera.org:8080/14776
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idc03faddb27754001290bb6d899840e2cbe7ccb7
Gerrit-Change-Number: 14776
Gerrit-PatchSet: 5
Gerrit-Owner: Jiawei Wang <[email protected]>
Gerrit-Reviewer: Andrew Sherman <[email protected]>
Gerrit-Reviewer: Balazs Jeszenszky <[email protected]>
Gerrit-Reviewer: David Rorke <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jiawei Wang <[email protected]>
Gerrit-Reviewer: Jiawei Wang <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Thu, 05 Dec 2019 14:30:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7550: Add documentation to profile counters

Reply via email to