Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10855 )
Change subject: IMPALA-7213: Port ReportExecStatus() RPC to use KRPC ...................................................................... Patch Set 5: (16 comments) http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/data-sink.h File be/src/exec/data-sink.h: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/data-sink.h@100 PS5, Line 100: static const string ROOT_PARTITION_KEY; can this be a const char* const ROOT_PARTITION_KEY instead? usually static non-PODs are discouraged -- https://google.github.io/styleguide/cppguide.html#Static_and_Global_Variables http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/hdfs-table-writer.h File be/src/exec/hdfs-table-writer.h: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/hdfs-table-writer.h@91 PS5, Line 91: InsertStatsPB style: is this meant to be returning a mutable reference or should this be const? if mutable maybe it would be better to return a pointer to align with the style guide? (I see that you just translated the existing code but maybe worth fixing while you are here) http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/coordinator-backend-state.cc File be/src/runtime/coordinator-backend-state.cc: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/coordinator-backend-state.cc@254 PS5, Line 254: for (int i = 0; i < backend_exec_status.instance_exec_status_size(); ++i) { : const FragmentInstanceExecStatusPB& instance_exec_status = : backend_exec_status.instance_exec_status(i); you can use a C++11 style loop with protobuf repeated elements too http://gerrit.cloudera.org:8080/#/c/10855/4/be/src/runtime/query-state.cc File be/src/runtime/query-state.cc: http://gerrit.cloudera.org:8080/#/c/10855/4/be/src/runtime/query-state.cc@274 PS4, Line 274: is is > I don't think this is expected to happen very often but it seems more robus right, I was suggesting DFATAL which crashes impala only in DEBUG builds, but otherwise just emits an ERROR-level log. That way if this starts to get broken we'd get a more obvious failure in debug test runs http://gerrit.cloudera.org:8080/#/c/10855/4/be/src/runtime/query-state.cc@284 PS4, Line 284: ThriftSerializer serializer(true); > The new patch creates a single instance of the RPC parameter and reuse it o yep, that's right. The parameter is serialized out of the PB when you call the function on the proxy, and you're free to do what you want with the protobuf after that. http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc File be/src/runtime/query-state.cc: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@233 PS5, Line 233: UniqueIdPB* query_id_pb = report->mutable_query_id(); : query_id_pb->set_lo(query_id().lo); : query_id_pb->set_hi(query_id().hi); I thought I saw some utility code for this elsewhere http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@243 PS5, Line 243: UniqueIdPB* finstance_id_pb = instance_status->mutable_fragment_instance_id(); : finstance_id_pb->set_lo(fis->instance_id().lo); : finstance_id_pb->set_hi(fis->instance_id().hi); and here http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/runtime-state.cc File be/src/runtime/runtime-state.cc: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/runtime-state.cc@200 PS5, Line 200: exec_status this is weird here because there is no 'exec_status' in this function http://gerrit.cloudera.org:8080/#/c/10855/4/be/src/service/control-service.cc File be/src/service/control-service.cc: http://gerrit.cloudera.org:8080/#/c/10855/4/be/src/service/control-service.cc@60 PS4, Line 60: ControlService > No serious measure on perf yet but yes, there is no point in pushing beyond It seems like Coordinator::BackendState::ApplyExecStatusReport takes a per-query lock. So if you have one query with lots and lots of fragments reporting, threads can stack up waiting on that lock because you're effectively limited to one core worth of throughput doing the merging. Reports for some other concurrent query which might be small (few fragments) can get starved because the big query is tying up all the service threads. There are some fancier options here like queueing the reports instead of acquiring the lock, etc, but maybe in the interim before trying anything fancy it makes sense to have a limit here that is larger than the number of cores? http://gerrit.cloudera.org:8080/#/c/10855/4/be/src/service/control-service.cc@82 PS4, Line 82: Status::Expected(err).ToProto(response->mutable_status()); : // Release the memory against the control service's memory tracker. would it be useful to also print rpc_context->requestor_string in this error so we know which impalad sent the report? http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc File be/src/service/control-service.cc: http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc@43 PS5, Line 43: control services shouldn't this be "control service's" since there is only one control service? http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc@89 PS5, Line 89: DCHECK_EQ(request->instance_exec_status_size(), 1); hmm, why is this repeated instead of just optional, then? Also for robustness I think it's usually safer to check conditions on RPC arguments with an if statement and respoind failure in the case of an incorrect one. Or, just use a CHECK, because in a release build, if this check doesn't pass, you'll likely end up reading or scribbling over some random memory below on line 91 and it'll just be harder to understand what the problem is. http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc@117 PS5, Line 117: status.ToProto(response->mutable_status()); : // Release the memory against the control service's memory tracker. : mem_tracker_->Release(rpc_context->GetTransferSize()); : rpc_context->RespondSuccess(); since both of the return points of thsi function have this same code, I think it makes sense to refactor out the body of the function to something like ReportExecStatusInternal or HandleReportExecStatus, which would return a Status. That way if you add a later return point to this function there's less risk of a bug like forgetting to release the memory from the memtracker (which would be tricky to debug if it happened) http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/common.proto File common/protobuf/common.proto: http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/common.proto@43 PS5, Line 43: // The kind of value that a metric represents. do these need to be kept in sync with common/thrift/Metrics.thrift? Worth leaving a note and probably a pointer to any code which is required to translate between them, if we aren't just relying on equal enum int values http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/control_service.proto File common/protobuf/control_service.proto: http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/control_service.proto@153 PS5, Line 153: optional InsertExecStatusPB insert_exec_status = 5; can you document whether this should be sent incrementally as you go (potentially muiltiple times) or whether it should only be sent once when the insert finishes? It seems to me that the handling of it is not idempotent (eg counters are summed rather than replaced). http://gerrit.cloudera.org:8080/#/c/10855/5/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: http://gerrit.cloudera.org:8080/#/c/10855/5/common/thrift/ImpalaInternalService.thrift@380 PS5, Line 380: 7: optional Types.TNetworkAddress coord_krpc_address unrelated to this patch but it always bugs me to see changes to thrift field IDs, since it breaks rolling upgrades in very hard-to-understand ways. I know we don't support rolling upgrade but I'd rather see an error message about a missing field than have it attempt to deserialize one field using the data of another -- To view, visit http://gerrit.cloudera.org:8080/10855 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7638583b433dcac066b87198e448743d90415ebe Gerrit-Change-Number: 10855 Gerrit-PatchSet: 5 Gerrit-Owner: Michael Ho <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Comment-Date: Mon, 06 Aug 2018 17:36:15 +0000 Gerrit-HasComments: Yes
