Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/9690 )
Change subject: IMPALA-6685: Improve profiles in KrpcDataStreamRecvr and KrpcDataStreamSender ...................................................................... Patch Set 4: (16 comments) http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-mgr.cc File be/src/runtime/krpc-data-stream-mgr.cc: http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-mgr.cc@201 PS3, Line 201: TUniqueId finst_id; > its impossible to know what this is all about until you read the other code Done http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-recvr.h File be/src/runtime/krpc-data-stream-recvr.h: http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-recvr.h@171 PS3, Line 171: num_deferred_rpcs() > should this be num_deferred_rpcs? We've been referring to these as "deferre Done http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-recvr.h@245 PS3, Line 245: Following counters belong to 'dequeue_profile > is that accurate? what does "deferred row batches" have to do with sampling Oops.. copied and pasted from elsewhere. http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-recvr.h@254 PS3, Line 254: RuntimeProfile::Counter* queue_get_batch_time_; > then it should be in this section, right? Done http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-recvr.h@255 PS3, Line 255: > then why have data_wait_time_? Inactive time is not shown in the profile. It's only used for computing the "non-child" time of the exchange node. I find "non-child" time inaccurate but I will refrain from fixing it in this change. http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-recvr.h@262 PS3, Line 262: /// ------------------------------------------------------------------------------------ : /// Following counters belong to 'enqueue_pro > why is this in the "dequeue" profile? seems more enqueue side, no? Done http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-recvr.h@294 PS3, Line 294: RuntimeProfile::Counter* total_deferred_rpcs_time_; This is converted to tracking the cumulative time in which 'deferred_rpcs_' is not empty in the new patch as the average tends to be skewed outlier (e.g. the amount of time between when a RPC arrives and when the parent of exchange node actually calls GetBatch()). http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-recvr.cc File be/src/runtime/krpc-data-stream-recvr.cc: http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-recvr.cc@212 PS3, Line 212: parent_recvr), num_remaining_senders_(num_senders) { > nit: would be nice to order these in the same order of the comment Done http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-recvr.cc@457 PS3, Line 457: DataStreamService::RespondAndReleaseRpc(status, ctx->response, ctx->rpc_context, : recvr_->deferred_rpc_tracker()); : DequeueDeferredRpc(); : return; : } : : // Stops if inserting the batch causes us to go over the limit. : // Put 'ctx' back on the queue. : if (!CanEnqueue(batch_size)) { : ctx.swap(deferred_rpcs_.front()); : DCHECK(deferred_rpcs_.front().get() != nullptr); : return; : } : > why not use UnpackRequest() for this? Convert to a static function. UnpackRequest() was returning deserialized size while this returns the serialized size. http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-sender.cc File be/src/runtime/krpc-data-stream-sender.cc: http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-sender.cc@388 PS3, Line 388: _in_flight_batch_) > but that doesn't include time spent in the krpc service queue, right? so ho Updated the patch to also include the time in service queue. There is currently no easy way to track time spent in the outbound transfer queue. http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/krpc-data-stream-sender.cc@389 PS3, Line 389: int64_t network_time = total_time - resp_.receiver_latency_ns(); > why is that? why does total time not include time spent waiting for the rep In the Thrift based implementation, we are mostly tracking the time to cache the send buffer and the serialization time. PS4 fixes this up by counting the total time as time spent in Send(), FlushFinal(), Close(), Open() and Prepare(). Also counted the time spent in WaitForRpc() as "inactive time" so as to show the actual overhead of DSS. http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/runtime-state.h File be/src/runtime/runtime-state.h: http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/runtime/runtime-state.h@a356 PS3, Line 356: > is the "across all threads" part true? maybe include that in the new commen "across all threads" is not valid anymore.There is only one thread: the fragment instance execution threads. I believe it used to be true with Thrift when there is one thread per channel. http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/service/data-stream-service.cc File be/src/service/data-stream-service.cc: http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/service/data-stream-service.cc@80 PS3, Line 80: void DataStreamService::TransmitData(const TransmitDataRequestPB* request, > I think it would be clearer to set the initial value of set_queue_time() he With the new patch, we will now use the time a RPC is received recorded in an inbound call. http://gerrit.cloudera.org:8080/#/c/9690/3/be/src/service/data-stream-service.cc@88 PS3, Line 88: void DataStreamService::RespondRpc(const Status& status, : ResponsePBType* response, kudu::rpc::RpcContext* ctx) { : MonoDelta duration(Mon > could just call RespondRpc(), no? Done http://gerrit.cloudera.org:8080/#/c/9690/3/common/protobuf/data_stream_service.proto File common/protobuf/data_stream_service.proto: http://gerrit.cloudera.org:8080/#/c/9690/3/common/protobuf/data_stream_service.proto@53 PS3, Line 53: // Latency for response in the receiver in nanoseconds. > what is "queue time"? Done http://gerrit.cloudera.org:8080/#/c/9690/3/common/protobuf/data_stream_service.proto@53 PS3, Line 53: // Latency for response in the receiver in nanoseconds. > After reading through the code, "queue time" seems confusing because there Done -- To view, visit http://gerrit.cloudera.org:8080/9690 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8ba405921b3df920c1e85b940ce9c8d02fc647cd Gerrit-Change-Number: 9690 Gerrit-PatchSet: 4 Gerrit-Owner: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-Comment-Date: Sat, 24 Mar 2018 22:54:39 +0000 Gerrit-HasComments: Yes