[ 
https://issues.apache.org/jira/browse/IMPALA-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520935#comment-16520935
 ] 

Michael Ho commented on IMPALA-6818:
------------------------------------

[~dhecht] and I discussed about an alternative which may allow us to get rid of 
this timeout option.

When a sender issues a {{TransmitData()}} RPC before the receiver is created, 
the data stream manager should just reply to the sender with an error message 
that it's not found. The sender will then keep pinging the destination fragment 
periodically until the receiver shows up or until the sender's fragment is 
cancelled. There are some races which may still occur:

- The receiver is closed before the sender manages to send the first row batch 
to the receiver. In which case, the sender will keep pinging the destination 
fragment until the sender fragment is cancelled. The assumption is that the 
receiver will only be closed prematurely before receiving all the data from the 
sender when (1) the query is cancelled or (2) the ancestor executors of the 
exchange node has a limit and closes the exchange node once the limit is 
reached. In case (2), it's expected that the report status thread of the 
sender's fragment will notice that all results have been returned and cancel 
the sender fragment (I believe this may not yet be the case until IMPALA-5119 
is fixed. [~dhecht], please correct me if I misunderstood anything).

> Rethink data-stream sender/receiver startup sequencing
> ------------------------------------------------------
>
>                 Key: IMPALA-6818
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6818
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Distributed Exec
>            Reporter: Dan Hecht
>            Assignee: Michael Ho
>            Priority: Major
>
> IMPALA-1599 introduced parallel fragment startup, which is good for startup 
> latency. However, it meant that data-stream senders can start before 
> receivers, and there is a timeout to handle the case when the receiver never 
> shows up:
> {code:java}
> Sender timed out waiting for receiver fragment instance{code}
> We see this timeout fairly regularly (e.g. when a host has a spike in load 
> and does not process the exec rpc for a while). Let's rethink how this works 
> to see if we can make it robust but being careful to not sacrifice startup 
> time too much.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to