Henry Robinson has posted comments on this change.

Change subject: IMPALA-3977: TransmitData() should not block
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5491/1/be/src/runtime/data-stream-mgr.cc
File be/src/runtime/data-stream-mgr.cc:

PS1, Line 53: STREAM_EXPIRATION_TIME_MS
> I've thought about this for a while and it makes sense that the cache is no
I think we can get the same effect as the stream cache without the complexity. 
The key is distinguishing between all the different states: 

If a receiver is not present, return EAGAIN. Either a) this is the first batch, 
so try again or b) we've already successfully sent > 0 batches, so the sender 
knows the receiver has gone away. The case you describe, where there's an 
upstream limit, could lead to a sender retrying if it never contributed a batch 
to the limit. But that's not very common. If we have concerns about the network 
usage of retrying, we can always fall back to probing without a payload.

If the receiver is present, but the queue is full, return a different status 
code to distinguish from the case where the receiver is not present. This 
allows us to fail fast in the former case where the receiver has been torn down 
but we've already sent a batch, so we know it was once there.


http://gerrit.cloudera.org:8080/#/c/5491/1/be/src/runtime/data-stream-sender.cc
File be/src/runtime/data-stream-sender.cc:

PS1, Line 209: TTransmitDataResult res;
> TTransmitDataResult has only one field, i.e. Status.
Don't rely on TTransmitDataResult not changing.

There's no reason L224+ have to be outside the loop.


PS1, Line 217: DoTransmitDataRpc
> I like the idea, but isn't that optimizing for a worse-case scenario? Keepi
Yes, I think the sender needs to time-out of AddBatch(). Otherwise it could get 
blocked in the queue and hold a service pool thread.

I'm ok with waiting to see if the probe request has a benefit, but I still 
think we'll need to distinguish the first RPC from all subsequent ones to 
distinguish between "receiver is not there yet" from "receiver was here and has 
disappeared" conditions (see other comments for details).

I'd expect the probe to take a few ms at most, and the network capacity for 
those should be effectively infinite given our requirements. The capacity for 
full batch sends might be smaller, which affects the latency - I'm not sure we 
can say that throughput is not an issue. Anyhow, you can defer that change for 
now.


-- 
To view, visit http://gerrit.cloudera.org:8080/5491
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I47f9d6c1de95a72ab73471dcd2a3118ef13e7db0
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-HasComments: Yes

Reply via email to