Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10415 )
Change subject: IMPALA-6034: Add Cpu and scanned bytes limits per query ...................................................................... Patch Set 1: (29 comments) I have one major concern about the coordinator locking that we need to fix then a bunch of cleanup asks. http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator-backend-state.h File be/src/runtime/coordinator-backend-state.h: http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator-backend-state.h@115 PS1, Line 115: Cpu Maybe clarify: /// Return total user CPU consumption across all backends. http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator-backend-state.h@118 PS1, Line 118: /// Return total sys Cpu. /// Return total system CPU consumption across all backends. http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator-backend-state.h@121 PS1, Line 121: /// Return total scanned bytes. /// Return total scanned bytes (from HDFS or HDFS-compatible filesystem) across all backends. http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator-backend-state.h@124 PS1, Line 124: void GetBackendResourceUsage(int64_t& user_cpu, int64_t& sys_cpu, int64_t& scanned_bytes, It would be more readable if this returned a struct. 4 output arguments with the same type makes it easy to mess up argument order. http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator-backend-state.h@125 PS1, Line 125: int64_t& peak_mem nit: use pointers for output parameters http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator-backend-state.h@206 PS1, Line 206: /// total scan ranges complete across all scan nodes maybe add "for this backend" http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator-backend-state.h@209 PS1, Line 209: /// total user cpu consumed maybe add "for this backend" http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator-backend-state.h@213 PS1, Line 213: int64_t cpu_sys_ = 0; maybe add "for this backend" http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator-backend-state.h@281 PS1, Line 281: /// total scan ranges complete across all scan nodes maybe add "across all backends. Updated by AggregateBackendStats()." http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator-backend-state.h@284 PS1, Line 284: /// total user cpu consumed maybe add "across all backends. Updated by AggregateBackendStats()." http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator-backend-state.h@287 PS1, Line 287: /// total system cpu consumed maybe add "across all backends. Updated by AggregateBackendStats()." http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator.h@169 PS1, Line 169: int64_t& nit: use pointer for output parameters http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator.h@169 PS1, Line 169: void Maybe return a struct instead of multiple output arguments? http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator.cc@721 PS1, Line 721: : int64_t total_cpu =0, total_scanned_bytes =0, backend_user_cpu = 0, : backend_sys_cpu = 0,backend_bytes_read = 0, peak_memory = 0; nit: various whitespace inconsistencies http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator.cc@726 PS1, Line 726: nit: don't need some of this vertical whitespace http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator.cc@751 PS1, Line 751: query_profile_->AddInfoString("Total CPU time", cpu_total_info.str()); nit: capitalisation is a bit inconsistent - I think everything should be title case, i.e. "Read", "Time" http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator.cc@755 PS1, Line 755: query_profile_->AddInfoString("Per Node User time", cpu_user_info.str()); Any reason why we don't show the aggregate user and system CPU time across all backends? I guess people will look at the top-level number then only drill down in rare cases? http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/runtime/coordinator.cc@802 PS1, Line 802: total_user_cpu += backend_state->GetUserCpu(); Should we modify the BackendState functions so that we can get all three values with a single lock acquisition, e.g. just have it return a BackendResourceUtilization struct with all three elements? http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/impala-server.h File be/src/service/impala-server.h: http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/impala-server.h@1041 PS1, Line 1041: int64_t max_cpu_time_ns; I think we can avoid including these new fields in ExpirationEvent - the expiration thread can just look at the query options directly when it's a RESOURCE_LIMIT event. http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/impala-server.cc@1015 PS1, Line 1015: now + (1000L), max_cpu_time_ns, 0, nit: don't needs parents around 1000L http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/impala-server.cc@1934 PS1, Line 1934: int64_t cpu_time_ns =0 , scan_bytes =0; nit: weird whitespace http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/impala-server.cc@1935 PS1, Line 1935: query_state->coord()->AggregateBackendsResourceUsage(cpu_time_ns, scan_bytes); Ahh... I think I see a problem. It looks like this is acquiring Coordinator::lock_, which can be held for a long time when the coordinator is doing things. We don't want the expiration thread getting stuck in those cases. Dan's coordinator patch that got reverted will remove some of the complications here, but I'll have to do another pass over this to figure out how the locking should work so that this thread cannot get block. This does seem tractable though. http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/impala-server.cc@1943 PS1, Line 1943: "Query $0 expired due to cpu limit of $1", Should these messages include the amount of CPU consumed by the query? http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/impala-server.cc@1957 PS1, Line 1957: "Query $0 expired due to scan bytes limit of $1", Should these messages include the bytes scanned by the query? http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/impala-server.cc@1967 PS1, Line 1967: I think we can get rid of this blank line and following comment. http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/query-options-test.cc File be/src/service/query-options-test.cc: http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/query-options-test.cc@232 PS1, Line 232: {MAKE_OPTIONDEF(exec_time_limit_s), {0, I32_MAX}}, Add the CPU time option here? http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/query-options.cc File be/src/service/query-options.cc: http://gerrit.cloudera.org:8080/#/c/10415/1/be/src/service/query-options.cc@678 PS1, Line 678: LOG(INFO) << "Max Cpu time is : " << max_cpu_time_s << " seconds"; Leftover log message? http://gerrit.cloudera.org:8080/#/c/10415/1/tests/custom_cluster/test_query_expiration.py File tests/custom_cluster/test_query_expiration.py: http://gerrit.cloudera.org:8080/#/c/10415/1/tests/custom_cluster/test_query_expiration.py@168 PS1, Line 168: @CustomClusterTestSuite.with_args("--idle_query_timeout=0") I don't think this actually needs to be a custom cluster test. We can make it a parallel test. I did something similar here but the patch isn't merged yet: https://gerrit.cloudera.org/#/c/10365/4/tests/query_test/test_resource_limits.py http://gerrit.cloudera.org:8080/#/c/10415/1/tests/custom_cluster/test_query_expiration.py@219 PS1, Line 219: sleep(30) We should avoid long sleeps in tests, they make the tests slow and can mask problems. The timeout options are inherently time-based, so we kind-of needed to do that above, but we don't need to do that here. I think we can just poll each of the client handles until they change state, right? With some timeout... -- To view, visit http://gerrit.cloudera.org:8080/10415 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4c6015e21da684bb9f33e236d71309dd4c178a20 Gerrit-Change-Number: 10415 Gerrit-PatchSet: 1 Gerrit-Owner: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Wed, 16 May 2018 19:16:27 +0000 Gerrit-HasComments: Yes