Henry Robinson has posted comments on this change. Change subject: IMPALA-3977: TransmitData() should not block ......................................................................
Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/5491/1/be/src/runtime/data-stream-mgr.cc File be/src/runtime/data-stream-mgr.cc: PS1, Line 53: STREAM_EXPIRATION_TIME_MS > I've thought about this for a while and it makes sense that the cache is no I think we can get the same effect as the stream cache without the complexity. The key is distinguishing between all the different states: If a receiver is not present, return EAGAIN. Either a) this is the first batch, so try again or b) we've already successfully sent > 0 batches, so the sender knows the receiver has gone away. The case you describe, where there's an upstream limit, could lead to a sender retrying if it never contributed a batch to the limit. But that's not very common. If we have concerns about the network usage of retrying, we can always fall back to probing without a payload. If the receiver is present, but the queue is full, return a different status code to distinguish from the case where the receiver is not present. This allows us to fail fast in the former case where the receiver has been torn down but we've already sent a batch, so we know it was once there. http://gerrit.cloudera.org:8080/#/c/5491/1/be/src/runtime/data-stream-sender.cc File be/src/runtime/data-stream-sender.cc: PS1, Line 209: TTransmitDataResult res; > TTransmitDataResult has only one field, i.e. Status. Don't rely on TTransmitDataResult not changing. There's no reason L224+ have to be outside the loop. PS1, Line 217: DoTransmitDataRpc > I like the idea, but isn't that optimizing for a worse-case scenario? Keepi Yes, I think the sender needs to time-out of AddBatch(). Otherwise it could get blocked in the queue and hold a service pool thread. I'm ok with waiting to see if the probe request has a benefit, but I still think we'll need to distinguish the first RPC from all subsequent ones to distinguish between "receiver is not there yet" from "receiver was here and has disappeared" conditions (see other comments for details). I'd expect the probe to take a few ms at most, and the network capacity for those should be effectively infinite given our requirements. The capacity for full batch sends might be smaller, which affects the latency - I'm not sure we can say that throughput is not an issue. Anyhow, you can defer that change for now. -- To view, visit http://gerrit.cloudera.org:8080/5491 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I47f9d6c1de95a72ab73471dcd2a3118ef13e7db0 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh Mukil <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-HasComments: Yes
