Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/12069 )
Change subject: IMPALA-7694: Add host resource usage metrics to profile ...................................................................... Patch Set 8: (25 comments) Thanks for the update! I want to do another pass, but I think I've got enough comments that it's useful to click send now. Overall, I think the structure looks great. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/coordinator-backend-state.cc File be/src/runtime/coordinator-backend-state.cc: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/coordinator-backend-state.cc@57 PS8, Line 57: const vector<FragmentStats*>& fragment_stats, RuntimeProfile* resource_profile, The other places we use resource_profile, it's a variable name for TBackendResourceProfile. I think it's ok with this name, but I had to think about it. I don't have better suggestions, so feel free to ignore. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/query-state.cc File be/src/runtime/query-state.cc: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/query-state.cc@142 PS8, Line 142: ExecEnv* exec_env = ExecEnv::GetInstance(); Can you move this to line 146 since it's only used in there? http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/query-state.cc@148 PS8, Line 148: return exec_env->system_state_info()->GetCpuUsageRatios().user; Should you capture exec_env->system_state_info() rather than exec_env? I don't know that it matters. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/runtime-state.cc File be/src/runtime/runtime-state.cc: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/runtime-state.cc@76 PS8, Line 76: profile_(RuntimeProfile::Create(obj_pool(), "<fragment instance>")), Is removing the id from the second argument here intentional? http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/pretty-printer.h File be/src/util/pretty-printer.h: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/pretty-printer.h@139 PS8, Line 139: case TUnit::BASE_POINTS: { : DCHECK_LE(value, 10000); : ss << (value / 100); : break; : } What's this end up looking like? Do we want to have some form of double precision here? http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-counters.h File be/src/util/runtime-profile-counters.h: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-counters.h@34 PS8, Line 34: namespace impala { You wrote a really lovely summary of the class structures in runtime-profile.h. Perhaps add a note here to point to that file? http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-counters.h@397 PS8, Line 397: const int64_t* GetSamplesTest(int* num_samples, int* period) { I don't mind, but we definitely have an occasional pattern of using friend classes to let tests reach into stuff they shouldn't, e.g.: friend class LlvmCodeGenTest_HashTest_Test; http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-test.cc File be/src/util/runtime-profile-test.cc: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-test.cc@907 PS8, Line 907: // Make sure the values are gone. : counter->GetSamplesTest(&num_samples, &period); : EXPECT_EQ(num_samples, 0); This isn't racy because "StopPeriodicCounters()" has been called, right? http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc File be/src/util/runtime-profile.cc: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@780 PS8, Line 780: const int64_t* samples = counter->GetSamplesLocked(&num, &period); If GetSamplesLocked() mutates counter, should we be doing that in a PrettyPrint()? http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@784 PS8, Line 784: int step = std::max((num - 1) / 32, 1); If num = 128, we should have step = 2. But, here, we have step = (128-1)/32 = 3. If I'm right, perhaps: int step = 1; if (num > 64) { step = ceil(num/64.0) // same as? // step = 1 + (num - 1)/64 } http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@793 PS8, Line 793: stream << " (thrift profile has " << num << " values, showing " << nit: Perhaps simply "showing X of Y values" (elide "thrift profile") http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@1243 PS8, Line 1243: // Mutable I agree that it's unintuitive that a GetSamples() function mutates the structure. Is that what you're trying to say here? Maybe elaborate on this comment or maybe we can find a better method name. (I don't have any great ideas; feel free to ignore.) http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@1290 PS8, Line 1290: ChunkedTimeSeriesCounter* c = dynamic_cast<ChunkedTimeSeriesCounter*>(it.second); Are we cool with dynamic_cast in this code base? I found only one or two uses. https://google.github.io/styleguide/cppguide.html#Run-Time_Type_Information__RTTI_ dislikes it. // When you upcast (that is, cast a pointer from type Foo to type // SuperclassOfFoo), it's fine to use implicit_cast<>, since upcasts // always succeed. When you downcast (that is, cast a pointer from // type Foo to type SubclassOfFoo), static_cast<> isn't safe, because // how do you know the pointer is really of type SubclassOfFoo? It // could be a bare Foo, or of type DifferentSubclassOfFoo. Thus, // when you downcast, you should use this macro. In debug mode, we // use dynamic_cast<> to double-check the downcast is legal (we die // if it's not). In normal mode, we do the efficient static_cast<> // instead. Thus, it's important to test in debug mode to make sure // the cast is legal! // This is the only place in the code we should use dynamic_cast<>. // In particular, you SHOULDN'T be using dynamic_cast<> in order to // do RTTI (eg code like this: // if (dynamic_cast<Subclass1>(foo)) HandleASubclass1Object(foo); // if (dynamic_cast<Subclass2>(foo)) HandleASubclass2Object(foo); // You should design the code some other way not to need this. template<typename To, typename From> // use like this: down_cast<T*>(foo); inline To down_cast(From* f) { // so we only accept pointers http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.h File be/src/util/system-state-info.h: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.h@70 PS8, Line 70: enum PROC_STAT_CPU_FIELDS { : CPU_USER, : CPU_NICE, : CPU_SYSTEM, : CPU_IDLE, : CPU_IOWAIT : }; : : typedef std::array<int64_t, NUM_CPU_FIELDS> CpuValues; I assume we have a reason to not do struct CpuValues { int64_t cpu_user } but rather have this enum driven? If so, let's write it down. If not, maybe a struct is better? http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc File be/src/util/system-state-info.cc: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@33 PS8, Line 33: memset(&cpu_ratios_, 0, sizeof(cpu_ratios_)); Can this have been cpu_ratios_ { 0 } in the constructor initializer list? (Note that I'm far from an expert on this stuff.) http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@88 PS8, Line 88: int64_t cur_sum = accumulate(cur.begin(), cur.end(), 0); I get what you're doing, but I think CpuValues.total() is clearer if you go down the route of making it a struct. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@95 PS8, Line 95: memset(&cpu_ratios_, 0, sizeof(cpu_ratios_)); I don't get this case (or the comment). Is this just: "If no time has passed, the ratio is zero (to avoid dividing by zero)". Should we have simply not read the value to begin with? http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@99 PS8, Line 99: // Convert each ratio to base points. I think these are called "basis points", or at least I couldn't find a base point reference. http://gerrit.cloudera.org:8080/#/c/12069/8/bin/plot_profile_resource_usage.py File bin/plot_profile_resource_usage.py: http://gerrit.cloudera.org:8080/#/c/12069/8/bin/plot_profile_resource_usage.py@136 PS8, Line 136: if len(space_separated) == 3: : ts = int(space_separated[0]) : print datetime.datetime.fromtimestamp(ts / 1000.0).isoformat(), space_separated[1] : base64_encoded = space_separated[2] : elif len(space_separated) == 1: : base64_encoded = space_separated[0] : else: : raise Exception("Unexpected line: " + line) Not a blocker, but you could move this and bin/parse-thrift-profile.py both into lib/py and share code. Can be done separately. http://gerrit.cloudera.org:8080/#/c/12069/8/common/thrift/Metrics.thrift File common/thrift/Metrics.thrift: http://gerrit.cloudera.org:8080/#/c/12069/8/common/thrift/Metrics.thrift@34 PS8, Line 34: BASE_POINTS, I think this is BASIS_POINTS. https://www.merriam-webster.com/dictionary/basis%20point http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py File tests/query_test/test_observability.py: http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@365 PS8, Line 365: for query_opts in [None, {'resource_trace_ratio': 1.0}]: query_opts is not used? I think you cut and pasted this from the test below... (Now, setting it to 0.0 and using it makes sense.) http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@368 PS8, Line 368: expected_strs = ["CpuIoWaitPercentage (500.000ms):", I would argue these are "unexpected", and, if these are not expected, maybe we can drop the duration or regexpify it? http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@417 PS8, Line 417: @pytest.mark.execute_serially Why serial? If you run it in parallel, the fact that it's long is not nearly as bothersome, because it's implicitly divided by 16 or whatever. http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@419 PS8, Line 419: """Tests that the resolution of a CPU usage counter increases once they exceed 64 This is effectively a unit test for PrettyPrinting the profile, yes? If we wanted to test that function, which maybe I found a bug in, we could probably do it more explicitly. http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@442 PS8, Line 442: assert len(string_values) <= 64 you can also assert len(string_values) >= 33, because if it's 32 or above, we shouldn't have sampled down. -- To view, visit http://gerrit.cloudera.org:8080/12069 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3aedc20c553ab8d7ed50f72a1a936eba151487d9 Gerrit-Change-Number: 12069 Gerrit-PatchSet: 8 Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Philip Zeyliger <[email protected]> Gerrit-Comment-Date: Thu, 10 Jan 2019 17:05:18 +0000 Gerrit-HasComments: Yes
