Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10855 )

Change subject: IMPALA-7213: Port ReportExecStatus() RPC to use KRPC
......................................................................


Patch Set 5:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/data-sink.h
File be/src/exec/data-sink.h:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/data-sink.h@100
PS5, Line 100:   static const string ROOT_PARTITION_KEY;
can this be a const char* const ROOT_PARTITION_KEY instead? usually static 
non-PODs are discouraged -- 
https://google.github.io/styleguide/cppguide.html#Static_and_Global_Variables


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/hdfs-table-writer.h
File be/src/exec/hdfs-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/exec/hdfs-table-writer.h@91
PS5, Line 91: InsertStatsPB
style: is this meant to be returning a mutable reference or should this be 
const? if mutable maybe it would be better to return a pointer to align with 
the style guide? (I see that you just translated the existing code but maybe 
worth fixing while you are here)


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/coordinator-backend-state.cc
File be/src/runtime/coordinator-backend-state.cc:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/coordinator-backend-state.cc@254
PS5, Line 254:   for (int i = 0; i < 
backend_exec_status.instance_exec_status_size(); ++i) {
             :     const FragmentInstanceExecStatusPB& instance_exec_status =
             :         backend_exec_status.instance_exec_status(i);
you can use a C++11 style loop with protobuf repeated elements too


http://gerrit.cloudera.org:8080/#/c/10855/4/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/10855/4/be/src/runtime/query-state.cc@274
PS4, Line 274: is is
> I don't think this is expected to happen very often but it seems more robus
right, I was suggesting DFATAL which crashes impala only in DEBUG builds, but 
otherwise just emits an ERROR-level log. That way if this starts to get broken 
we'd get a more obvious failure in debug test runs


http://gerrit.cloudera.org:8080/#/c/10855/4/be/src/runtime/query-state.cc@284
PS4, Line 284:   ThriftSerializer serializer(true);
> The new patch creates a single instance of the RPC parameter and reuse it o
yep, that's right. The parameter is serialized out of the PB when you call the 
function on the proxy, and you're free to do what you want with the protobuf 
after that.


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@233
PS5, Line 233:   UniqueIdPB* query_id_pb = report->mutable_query_id();
             :   query_id_pb->set_lo(query_id().lo);
             :   query_id_pb->set_hi(query_id().hi);
I thought I saw some utility code for this elsewhere


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/query-state.cc@243
PS5, Line 243:     UniqueIdPB* finstance_id_pb = 
instance_status->mutable_fragment_instance_id();
             :     finstance_id_pb->set_lo(fis->instance_id().lo);
             :     finstance_id_pb->set_hi(fis->instance_id().hi);
and here


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/runtime-state.cc
File be/src/runtime/runtime-state.cc:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/runtime/runtime-state.cc@200
PS5, Line 200: exec_status
this is weird here because there is no 'exec_status' in this function


http://gerrit.cloudera.org:8080/#/c/10855/4/be/src/service/control-service.cc
File be/src/service/control-service.cc:

http://gerrit.cloudera.org:8080/#/c/10855/4/be/src/service/control-service.cc@60
PS4, Line 60:  ControlService
> No serious measure on perf yet but yes, there is no point in pushing beyond
It seems like Coordinator::BackendState::ApplyExecStatusReport takes a 
per-query lock. So if you have one query with lots and lots of fragments 
reporting, threads can stack up waiting on that lock because you're effectively 
limited to one core worth of throughput doing the merging. Reports for some 
other concurrent query which might be small (few fragments) can get starved 
because the big query is tying up all the service threads.

There are some fancier options here like queueing the reports instead of 
acquiring the lock, etc, but maybe in the interim before trying anything fancy 
it makes sense to have a limit here that is larger than the number of cores?


http://gerrit.cloudera.org:8080/#/c/10855/4/be/src/service/control-service.cc@82
PS4, Line 82:     Status::Expected(err).ToProto(response->mutable_status());
            :     // Release the memory against the control service's memory 
tracker.
would it be useful to also print rpc_context->requestor_string in this error so 
we know which impalad sent the report?


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc
File be/src/service/control-service.cc:

http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc@43
PS5, Line 43: control services
shouldn't this be "control service's" since there is only one control service?


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc@89
PS5, Line 89:   DCHECK_EQ(request->instance_exec_status_size(), 1);
hmm, why is this repeated instead of just optional, then?

Also for robustness I think it's usually safer to check conditions on RPC 
arguments with an if statement and respoind failure in the case of an incorrect 
one. Or, just use a CHECK, because in a release build, if this check doesn't 
pass, you'll likely end up reading or scribbling over some random memory below 
on line 91 and it'll just be harder to understand what the problem is.


http://gerrit.cloudera.org:8080/#/c/10855/5/be/src/service/control-service.cc@117
PS5, Line 117:   status.ToProto(response->mutable_status());
             :   // Release the memory against the control service's memory 
tracker.
             :   mem_tracker_->Release(rpc_context->GetTransferSize());
             :   rpc_context->RespondSuccess();
since both of the return points of thsi function have this same code, I think 
it makes sense to refactor out the body of the function to something like 
ReportExecStatusInternal or HandleReportExecStatus, which would return a 
Status. That way if you add a later return point to this function there's less 
risk of a bug like forgetting to release the memory from the memtracker (which 
would be tricky to debug if it happened)


http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/common.proto
File common/protobuf/common.proto:

http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/common.proto@43
PS5, Line 43: // The kind of value that a metric represents.
do these need to be kept in sync with common/thrift/Metrics.thrift? Worth 
leaving a note and probably a pointer to any code which is required to 
translate between them, if we aren't just relying on equal enum int values


http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/10855/5/common/protobuf/control_service.proto@153
PS5, Line 153:   optional InsertExecStatusPB insert_exec_status = 5;
can you document whether this should be sent incrementally as you go 
(potentially muiltiple times) or whether it should only be sent once when the 
insert finishes? It seems to me that the handling of it is not idempotent (eg 
counters are summed rather than replaced).


http://gerrit.cloudera.org:8080/#/c/10855/5/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/10855/5/common/thrift/ImpalaInternalService.thrift@380
PS5, Line 380:   7: optional Types.TNetworkAddress coord_krpc_address
unrelated to this patch but it always bugs me to see changes to thrift field 
IDs, since it breaks rolling upgrades in very hard-to-understand ways. I know 
we don't support rolling upgrade but I'd rather see an error message about a 
missing field than have it attempt to deserialize one field using the data of 
another



--
To view, visit http://gerrit.cloudera.org:8080/10855
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7638583b433dcac066b87198e448743d90415ebe
Gerrit-Change-Number: 10855
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Ho <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Comment-Date: Mon, 06 Aug 2018 17:36:15 +0000
Gerrit-HasComments: Yes

Reply via email to