Philip Zeyliger has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12069 )

Change subject: IMPALA-7694: Add host resource usage metrics to profile
......................................................................


Patch Set 7:

(22 comments)

Thanks! I took a first pass.

If I'm reading it correctly, this implementation will always read /proc/stat 
every 500ms, even if nobody has ever requested that data. Do we want to try to 
short-circuit it?

I personally found the inheritance (two types of TimeSeriesCounter) kind of 
awkward, but that's probably just me. The other variant is to explicitly say as 
an argument whether a TimeSeries is sampling or not (e.g., via "number of 
samples" or something), but I don't think it matters in the grand scheme of 
things.

I found the units (0-255) awkward. I think basis points or percentage points 
would be more natural, but I'm flexible.

Thanks!

http://gerrit.cloudera.org:8080/#/c/12069/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12069/7//COMMIT_MSG@7
PS7, Line 7: IMPALA-7694: Add host resource usage metrics to profile
Could you mention that you added a mechanism for periodic counter updates?


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/runtime/exec-env.cc
File be/src/runtime/exec-env.cc:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/runtime/exec-env.cc@463
PS7, Line 463:       std::bind(&SystemStateInfo::CaptureSystemStateSnapshot, 
system_state_info_.get());
I read in effective C++ that I should prefer lambdas to std::bind. Well, I 
really only read the table of contents. I assume it totally doesn't matter 
here, so as long as we're consistent, I'm good with it.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/runtime/query-state.cc@296
PS7, Line 296:   host_profile_->ClearChunkedTimeSeriesCounters();
Do you know what protects this being a race where you're clearing something 
that got added after 293?


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/service/impala-server.cc@1051
PS7, Line 1051:   if (rand() < trace_ratio * (RAND_MAX + 1L)) 
query_ctx->__set_trace_resource_usage(true);
I think rand() isn't thread safe. (And do we initialize it with srand() 
somewhere?) On some tracing code I'm working on, I tried:

+  double p = query_ctx->client_request.query_options.tracing_probability;
+  if (p > 0) {
+    std::mt19937_64 rng;
+    std::uniform_real_distribution<double> unif(0, 1);
+    if (unif(rng) < p) GetThreadDebugInfo()->SetTracingEnabled(true);
+  }


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/pretty-printer.h
File be/src/util/pretty-printer.h:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/pretty-printer.h@139
PS7, Line 139:       case TUnit::BYTE_FRACTION: {
What is a BYTE_FRACTION?


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile-counters.h@453
PS7, Line 453: class RuntimeProfile::ChunkedTimeSeriesCounter
             :     : public RuntimeProfile::TimeSeriesCounter {
The inheritance in this file is getting complicated. Could you add a diagram of 
the tree somewhere in the headers and indicate what the distinctions between 
them all are?


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile-counters.h@474
PS7, Line 474:
             :   virtual const int64_t* GetSamples(
             :       int* num_samples, int* period, SpinLock** lock = nullptr) 
const override;
Even though this is private, it seems like we should document what lock does 
here.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.h
File be/src/util/runtime-profile.h:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.h@386
PS7, Line 386:       const std::string& name, TUnit::type unit, 
DerivedCounterFunction sample_fn);
What makes sample_fn a "DerivedCounterFunction"? i.e., Derived is unclear to me.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.cc@779
PS7, Line 779:         // SampliSamplingTimeSeriesCounter.
typo


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.cc@788
PS7, Line 788:         stream << endl;
Consider printing something here that indicates that we actually have more 
samples in the thrift profile.

(I think it would also be reasonable to just print everything.)


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/runtime-profile.cc@1238
PS7, Line 1238:   DCHECK(period_ == period);
Given that this is a per-process flag, this could actually be false. Do we want 
to reject it outright or just accept the sender's version?


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info-test.cc
File be/src/util/system-state-info-test.cc:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info-test.cc@37
PS7, Line 37: TEST_F(SystemStateInfoTest, ReadProcStat) {
I think I'd expect a test here that takes a string (which represents the value 
of /proc/stat) and tests things more explicitly. If you do something like let 
the filename for /proc/stat be configurable, you can test things exactly.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info-test.cc@54
PS7, Line 54:     EXPECT_GT(r.user + r.system + r.iowait, 0);
I think you can reasonably expect this to be 200ms / number of CPUs, right?


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.h
File be/src/util/system-state-info.h:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.h@43
PS7, Line 43:   struct CpuUsageRatios {
What does "Ratio" mean here? What are the units of these?


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.h@64
PS7, Line 64:   enum PROC_STAT_CPU_FIELDS {
            :     CPU_USER,
            :     CPU_NICE,
            :     CPU_SYSTEM,
            :     CPU_IDLE,
            :     CPU_IOWAIT
            :   };
I think there could be more fields here (from man proc).


       /proc/stat
              kernel/system statistics.  Varies with architecture.  Common 
entries include:

              cpu  3357 0 4313 1362393
                     The amount of time, measured in units of USER_HZ (1/100ths 
of a second on most architectures, use sysconf(_SC_CLK_TCK) to obtain the right 
value), that the system spent in various states:

                     user   (1) Time spent in user mode.

                     nice   (2) Time spent in user mode with low priority 
(nice).

                     system (3) Time spent in system mode.

                     idle   (4) Time spent in the idle task.  This value should 
be USER_HZ times the second entry in the /proc/uptime pseudo-file.

                     iowait (since Linux 2.5.41)
                            (5) Time waiting for I/O to complete.

                     irq (since Linux 2.6.0-test4)
                            (6) Time servicing interrupts.

                     softirq (since Linux 2.6.0-test4)
                            (7) Time servicing softirqs.

                     steal (since Linux 2.6.11)
                            (8) Stolen time, which is the time spent in other 
operating systems when running in a virtualized environment

                     guest (since Linux 2.6.24)
                            (9) Time spent running a virtual CPU for guest 
operating systems under the control of the Linux kernel.

                     guest_nice (since Linux 2.6.33)
                            (10) Time spent running a niced guest (virtual CPU 
for guest operating systems under the control of the Linux kernel).


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc
File be/src/util/system-state-info.cc:

http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc@33
PS7, Line 33:   memset(&cpu_ratios_, 0, sizeof(cpu_ratios_));
I'm not an expert on this stuff, but can we use array initialization here?


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc@65
PS7, Line 65:   for (int i = 0; i < NUM_CPU_FIELDS; ++i) proc_file >> 
next_values[i];
Is there any error handling that needs to happen here?

>From a unit testing perspective, I'd be more comfortable if we explicitly 
>tested that line /proc/stat getting into CpuValues.


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc@75
PS7, Line 75:   if (total_tics == 0) {
Comment?


http://gerrit.cloudera.org:8080/#/c/12069/7/be/src/util/system-state-info.cc@80
PS7, Line 80:   cpu_ratios_.user = ((cur[CPU_USER] - old[CPU_USER]) << 8) / 
total_tics;
            :   cpu_ratios_.system = ((cur[CPU_SYSTEM] - old[CPU_SYSTEM]) << 8) 
/ total_tics;
            :   cpu_ratios_.iowait = ((cur[CPU_IOWAIT] - old[CPU_IOWAIT]) << 8) 
/ total_tics;
Comment what this is doing?


http://gerrit.cloudera.org:8080/#/c/12069/7/common/thrift/Metrics.thrift
File common/thrift/Metrics.thrift:

http://gerrit.cloudera.org:8080/#/c/12069/7/common/thrift/Metrics.thrift@32
PS7, Line 32:   // Fraction between 0 and 1 with single byte precision [0, 255].
            :   // Pretty-printed as a percentage [0, 100].
            :   BYTE_FRACTION,
This is a weird unit. I guess we're constrained to int64 here, so that's how 
this came about?

Basis points are real units (hundredths of one percent). So are PERCENTS (with 
values 0-100).


http://gerrit.cloudera.org:8080/#/c/12069/7/common/thrift/RuntimeProfile.thrift
File common/thrift/RuntimeProfile.thrift:

http://gerrit.cloudera.org:8080/#/c/12069/7/common/thrift/RuntimeProfile.thrift@73
PS7, Line 73:   5: required i64 start_index
By marking this required, I think you make Thrift fail to understand an old one 
of these. Since we maybe try to keep the profiles compatible, it may be wiser 
to mark this optional.


http://gerrit.cloudera.org:8080/#/c/12069/7/tests/query_test/test_observability.py
File tests/query_test/test_observability.py:

http://gerrit.cloudera.org:8080/#/c/12069/7/tests/query_test/test_observability.py@422
PS7, Line 422:     query_opts = {'resource_trace_ratio': 1.0}
Do we have a test that asserts that resource_trace_ratio = 0 also works 
correctly?



--
To view, visit http://gerrit.cloudera.org:8080/12069
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3aedc20c553ab8d7ed50f72a1a936eba151487d9
Gerrit-Change-Number: 12069
Gerrit-PatchSet: 7
Gerrit-Owner: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com>
Gerrit-Comment-Date: Fri, 14 Dec 2018 22:17:57 +0000
Gerrit-HasComments: Yes

Reply via email to