Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12069 )

Change subject: IMPALA-7694: Add host resource usage metrics to profile
......................................................................


Patch Set 9:

(25 comments)

Thanks for the review, please see PS9.

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/coordinator-backend-state.cc@57
PS8, Line 57:     const vector<FragmentStats*>& fragment_stats,
> The other places we use resource_profile, it's a variable name for TBackend
Done


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/query-state.cc@142
PS8, Line 142:   ExecEnv* exec_env = ExecEnv::GetInstance();
> Can you move this to line 146 since it's only used in there?
It's used in line 163, too.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/query-state.cc@148
PS8, Line 148:         "CpuUserPercentage", TUnit::BASIS_POINTS, 
[system_state_info] () {
> Should you capture
Done, I don't think it matters much but seems cleaner.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/runtime-state.cc
File be/src/runtime/runtime-state.cc:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/runtime-state.cc@76
PS8, Line 76:     profile_(RuntimeProfile::Create(obj_pool(), "<fragment 
instance>")),
> Is removing the id from the second argument here intentional?
These profiles get merged into others counter-by-counter and are not displayed 
below a parent profile. As such, the name doesn't matter and I picked this one 
to make it somewhat more obvious, similar to what we do elsewhere. Let me know 
if you'd like me to add a clarifying comment.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/pretty-printer.h
File be/src/util/pretty-printer.h:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/pretty-printer.h@139
PS8, Line 139:       case TUnit::BASIS_POINTS: {
             :         DCHECK_LE(value, 10000);
             :         ss << (value / 100);
             :         break;
             :       }
> What's this end up looking like? Do we want to have some form of double pre
This prints integer values, e.g. CpuUserPercentage (500.000ms): 3, 1

My reasoning was that percentage-granularity would be sufficient for the 
textual profiles. Sub-percent fluctuations would almost certainly be noise and 
clutter the view when glancing at the numbers.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-counters.h
File be/src/util/runtime-profile-counters.h:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-counters.h@34
PS8, Line 34:
> You wrote a really lovely summary of the class structures in runtime-profil
Done


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-counters.h@397
PS8, Line 397:
> I don't mind, but we definitely have an occasional pattern of using friend
I felt it was better to keep this here because it deals with accessing the data 
and locking, not with permission (public/private). The alternative (making the 
tests friends with this class) would mean that the test has to call the private 
virtual GetSamplesLocked() and each test would have to explain why it is safe 
to do so from a test. I had seen this pattern in the Kudu code before so I 
thought it fits better. Let me know if you prefer the friend approach or if I'm 
missing something.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-test.cc
File be/src/util/runtime-profile-test.cc:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-test.cc@907
PS8, Line 907:
             :   // Make sure the values are gone.
             :   counter->GetSamplesTest(&n
> This isn't racy because "StopPeriodicCounters()" has been called, right?
Yes, added a comment above L891.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@780
PS8, Line 780:         // SamplingTimeSeriesCounter.
> If GetSamplesLocked() mutates counter, should we be doing that in a PrettyP
You're right, we shouldn't mark the samples as retrieved here. We always only 
call ToThrift or PrettyPrint on a profile so this is not an issue, but it's not 
clear from the implementation at all. I split up GetSamplesLocked into two 
methods, a const one for printing and a non-const one for send. It's also 
possible to overload them on const/non-const but I find that too confusing.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@784
PS8, Line 784:                << PrettyPrinter::Print(period * 1000000L, 
TUnit::TIME_NS) << "): ";
> If num = 128, we should have step = 2. But, here, we have step = (128-1)/32
Done, thanks for catching this.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@793
PS8, Line 793:         stream << endl;
> nit:
My idea was to make it obvious where to find the whole list of values, i.e. 
that they're still available and had not been discarded during resampling. I'm 
happy to take your suggestion if you think that's clear enough, or any other 
(shorter) option, I'm not too happy with the current version myself but 
couldn't think of a better one either.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@1243
PS8, Line 1243:   values_.push_back(sample);
> I agree that it's unintuitive that a GetSamples() function mutates the stru
Done


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@1290
PS8, Line 1290:   TimeSeriesCounter* counter = pool_->Add(new 
ChunkedTimeSeriesCounter(name, unit, fn));
> Are we cool with dynamic_cast in this code base? I found only one or two us
I replaced it by adding the method to the interface.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.h
File be/src/util/system-state-info.h:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.h@70
PS8, Line 70:   enum PROC_STAT_CPU_VALUES {
            :     CPU_USER,
            :     CPU_NICE,
            :     CPU_SYSTEM,
            :     CPU_IDLE,
            :     CPU_IOWAIT
            :   };
            : 
            :   /// We store the CPU usage values in an array so that
> I assume we have a reason to not do struct CpuValues { int64_t cpu_user } b
I chose an array to iterate over it when reading and summing them up. I tried a 
struct, too, but it seemed too verbose to me. Let me know if you prefer a 
struct.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc
File be/src/util/system-state-info.cc:

http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@33
PS8, Line 33: SystemStateInfo::SystemStateInfo() : cpu_ratios_ {0} {
> Can this have been
Indeed, TIL. It looks cleaner to me, although I had to search the internet for 
a bit to convince myself that it does initialize all of the members 
(https://stackoverflow.com/questions/10828294/c-and-c-partial-initialization-of-automatic-structure).

I think memset(...) is slightly more idiomatic, but I'm good with the partial 
initialization, too.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@88
PS8, Line 88:
> I get what you're doing, but I think
Please see my comment in the header.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@95
PS8, Line 95:   // zero (to avoid dividing by zero).
> I don't get this case (or the comment). Is this just:
Yes, updated the comment. We could keep a timer around (and we'd have to parse 
all fields, even the ones we don't know about that might be introduced in later 
kernel versions) and only read the file when 1/USER_HZ seconds have passed. 
However, this is only relevant when calling the function in rapid succession, 
e.g. in a test.


http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@99
PS8, Line 99:   }
> I think these are called "basis points", or at least I couldn't find a base
Done


http://gerrit.cloudera.org:8080/#/c/12069/8/bin/plot_profile_resource_usage.py
File bin/plot_profile_resource_usage.py:

http://gerrit.cloudera.org:8080/#/c/12069/8/bin/plot_profile_resource_usage.py@136
PS8, Line 136:
             :
             :
             :
             :
             :
             :
             :
> Not a blocker, but you could move this and bin/parse-thrift-profile.py both
Done


http://gerrit.cloudera.org:8080/#/c/12069/8/common/thrift/Metrics.thrift
File common/thrift/Metrics.thrift:

http://gerrit.cloudera.org:8080/#/c/12069/8/common/thrift/Metrics.thrift@34
PS8, Line 34:   BASIS_POINTS,
> I think this is BASIS_POINTS.
Done


http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py
File tests/query_test/test_observability.py:

http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@365
PS8, Line 365:     query = "select count(*), sleep(1000) from 
functional.alltypes"
> query_opts is not used? I think you cut and pasted this from the test below
Done


http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@368
PS8, Line 368:       # We check for 500ms because a query with 1s duration 
won't hit the 64 values limit.
> I would argue these are "unexpected", and, if these are not expected, maybe
Done


http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@417
PS8, Line 417:   def 
test_query_profile_host_cpu_usage_resolution_reduction(self):
> Why serial? If you run it in parallel, the fact that it's long is not nearl
Done, this was a leftover from factoring out _get_thrift_profile above. Waiting 
for 300 seconds should be plenty though.


http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@419
PS8, Line 419:        values, and that they stay below 65 values."""
> This is effectively a unit test for PrettyPrinting the profile, yes? If we
Do you have a particular way of testing it in mind? I found it most convenient 
here because we'd end up picking apart the string result, and doing so in 
python is much easier.


http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@442
PS8, Line 442:       assert len(string_values) >= 33
> you can also assert len(string_values) >= 33, because if it's 32 or above,
Done



--
To view, visit http://gerrit.cloudera.org:8080/12069
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3aedc20c553ab8d7ed50f72a1a936eba151487d9
Gerrit-Change-Number: 12069
Gerrit-PatchSet: 9
Gerrit-Owner: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com>
Gerrit-Comment-Date: Fri, 11 Jan 2019 02:24:07 +0000
Gerrit-HasComments: Yes

Reply via email to