Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12069 )
Change subject: IMPALA-7694: Add host resource usage metrics to profile ...................................................................... Patch Set 9: (25 comments) Thanks for the review, please see PS9. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/coordinator-backend-state.cc File be/src/runtime/coordinator-backend-state.cc: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/coordinator-backend-state.cc@57 PS8, Line 57: const vector<FragmentStats*>& fragment_stats, > The other places we use resource_profile, it's a variable name for TBackend Done http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/query-state.cc File be/src/runtime/query-state.cc: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/query-state.cc@142 PS8, Line 142: ExecEnv* exec_env = ExecEnv::GetInstance(); > Can you move this to line 146 since it's only used in there? It's used in line 163, too. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/query-state.cc@148 PS8, Line 148: "CpuUserPercentage", TUnit::BASIS_POINTS, [system_state_info] () { > Should you capture Done, I don't think it matters much but seems cleaner. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/runtime-state.cc File be/src/runtime/runtime-state.cc: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/runtime/runtime-state.cc@76 PS8, Line 76: profile_(RuntimeProfile::Create(obj_pool(), "<fragment instance>")), > Is removing the id from the second argument here intentional? These profiles get merged into others counter-by-counter and are not displayed below a parent profile. As such, the name doesn't matter and I picked this one to make it somewhat more obvious, similar to what we do elsewhere. Let me know if you'd like me to add a clarifying comment. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/pretty-printer.h File be/src/util/pretty-printer.h: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/pretty-printer.h@139 PS8, Line 139: case TUnit::BASIS_POINTS: { : DCHECK_LE(value, 10000); : ss << (value / 100); : break; : } > What's this end up looking like? Do we want to have some form of double pre This prints integer values, e.g. CpuUserPercentage (500.000ms): 3, 1 My reasoning was that percentage-granularity would be sufficient for the textual profiles. Sub-percent fluctuations would almost certainly be noise and clutter the view when glancing at the numbers. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-counters.h File be/src/util/runtime-profile-counters.h: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-counters.h@34 PS8, Line 34: > You wrote a really lovely summary of the class structures in runtime-profil Done http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-counters.h@397 PS8, Line 397: > I don't mind, but we definitely have an occasional pattern of using friend I felt it was better to keep this here because it deals with accessing the data and locking, not with permission (public/private). The alternative (making the tests friends with this class) would mean that the test has to call the private virtual GetSamplesLocked() and each test would have to explain why it is safe to do so from a test. I had seen this pattern in the Kudu code before so I thought it fits better. Let me know if you prefer the friend approach or if I'm missing something. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-test.cc File be/src/util/runtime-profile-test.cc: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile-test.cc@907 PS8, Line 907: : // Make sure the values are gone. : counter->GetSamplesTest(&n > This isn't racy because "StopPeriodicCounters()" has been called, right? Yes, added a comment above L891. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc File be/src/util/runtime-profile.cc: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@780 PS8, Line 780: // SamplingTimeSeriesCounter. > If GetSamplesLocked() mutates counter, should we be doing that in a PrettyP You're right, we shouldn't mark the samples as retrieved here. We always only call ToThrift or PrettyPrint on a profile so this is not an issue, but it's not clear from the implementation at all. I split up GetSamplesLocked into two methods, a const one for printing and a non-const one for send. It's also possible to overload them on const/non-const but I find that too confusing. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@784 PS8, Line 784: << PrettyPrinter::Print(period * 1000000L, TUnit::TIME_NS) << "): "; > If num = 128, we should have step = 2. But, here, we have step = (128-1)/32 Done, thanks for catching this. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@793 PS8, Line 793: stream << endl; > nit: My idea was to make it obvious where to find the whole list of values, i.e. that they're still available and had not been discarded during resampling. I'm happy to take your suggestion if you think that's clear enough, or any other (shorter) option, I'm not too happy with the current version myself but couldn't think of a better one either. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@1243 PS8, Line 1243: values_.push_back(sample); > I agree that it's unintuitive that a GetSamples() function mutates the stru Done http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/runtime-profile.cc@1290 PS8, Line 1290: TimeSeriesCounter* counter = pool_->Add(new ChunkedTimeSeriesCounter(name, unit, fn)); > Are we cool with dynamic_cast in this code base? I found only one or two us I replaced it by adding the method to the interface. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.h File be/src/util/system-state-info.h: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.h@70 PS8, Line 70: enum PROC_STAT_CPU_VALUES { : CPU_USER, : CPU_NICE, : CPU_SYSTEM, : CPU_IDLE, : CPU_IOWAIT : }; : : /// We store the CPU usage values in an array so that > I assume we have a reason to not do struct CpuValues { int64_t cpu_user } b I chose an array to iterate over it when reading and summing them up. I tried a struct, too, but it seemed too verbose to me. Let me know if you prefer a struct. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc File be/src/util/system-state-info.cc: http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@33 PS8, Line 33: SystemStateInfo::SystemStateInfo() : cpu_ratios_ {0} { > Can this have been Indeed, TIL. It looks cleaner to me, although I had to search the internet for a bit to convince myself that it does initialize all of the members (https://stackoverflow.com/questions/10828294/c-and-c-partial-initialization-of-automatic-structure). I think memset(...) is slightly more idiomatic, but I'm good with the partial initialization, too. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@88 PS8, Line 88: > I get what you're doing, but I think Please see my comment in the header. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@95 PS8, Line 95: // zero (to avoid dividing by zero). > I don't get this case (or the comment). Is this just: Yes, updated the comment. We could keep a timer around (and we'd have to parse all fields, even the ones we don't know about that might be introduced in later kernel versions) and only read the file when 1/USER_HZ seconds have passed. However, this is only relevant when calling the function in rapid succession, e.g. in a test. http://gerrit.cloudera.org:8080/#/c/12069/8/be/src/util/system-state-info.cc@99 PS8, Line 99: } > I think these are called "basis points", or at least I couldn't find a base Done http://gerrit.cloudera.org:8080/#/c/12069/8/bin/plot_profile_resource_usage.py File bin/plot_profile_resource_usage.py: http://gerrit.cloudera.org:8080/#/c/12069/8/bin/plot_profile_resource_usage.py@136 PS8, Line 136: : : : : : : : > Not a blocker, but you could move this and bin/parse-thrift-profile.py both Done http://gerrit.cloudera.org:8080/#/c/12069/8/common/thrift/Metrics.thrift File common/thrift/Metrics.thrift: http://gerrit.cloudera.org:8080/#/c/12069/8/common/thrift/Metrics.thrift@34 PS8, Line 34: BASIS_POINTS, > I think this is BASIS_POINTS. Done http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py File tests/query_test/test_observability.py: http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@365 PS8, Line 365: query = "select count(*), sleep(1000) from functional.alltypes" > query_opts is not used? I think you cut and pasted this from the test below Done http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@368 PS8, Line 368: # We check for 500ms because a query with 1s duration won't hit the 64 values limit. > I would argue these are "unexpected", and, if these are not expected, maybe Done http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@417 PS8, Line 417: def test_query_profile_host_cpu_usage_resolution_reduction(self): > Why serial? If you run it in parallel, the fact that it's long is not nearl Done, this was a leftover from factoring out _get_thrift_profile above. Waiting for 300 seconds should be plenty though. http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@419 PS8, Line 419: values, and that they stay below 65 values.""" > This is effectively a unit test for PrettyPrinting the profile, yes? If we Do you have a particular way of testing it in mind? I found it most convenient here because we'd end up picking apart the string result, and doing so in python is much easier. http://gerrit.cloudera.org:8080/#/c/12069/8/tests/query_test/test_observability.py@442 PS8, Line 442: assert len(string_values) >= 33 > you can also assert len(string_values) >= 33, because if it's 32 or above, Done -- To view, visit http://gerrit.cloudera.org:8080/12069 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3aedc20c553ab8d7ed50f72a1a936eba151487d9 Gerrit-Change-Number: 12069 Gerrit-PatchSet: 9 Gerrit-Owner: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Philip Zeyliger <phi...@cloudera.com> Gerrit-Comment-Date: Fri, 11 Jan 2019 02:24:07 +0000 Gerrit-HasComments: Yes