[ 
https://issues.apache.org/jira/browse/DRILL-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14987825#comment-14987825
 ] 

Deneche A. Hakim commented on DRILL-3845:
-----------------------------------------

The patch fixes partition sender so it doesn't need to send the "last batch" 
for receivers that already received, but the real problem seems to be in the 
unordered receiver.

Looking at the senders and receivers, the assumption is that when a receiving 
fragment finishes (e.g. limit cancellation) the receiver sends a "receiver 
finished" message to it's sender(s), but still wait for a "last batch" message 
before closing.

Unordered receiver doesn't wait for the "last batch" message. Most of the times 
this is fine because the rpc layer (data server) gracefully handles batches 
that are sent to closed receivers, but in the case of DRILL-2274 the "last 
batch" is sent more than 10 minutes after the receiving fragment closed, which 
will cause a "Data not accepted downstream" because the data server couldn't 
find a receiving fragment (we have a 10 minutes cache that keeps recently 
finished fragments).

A proper fix is to make sure unordered receiver waits for the "last batch" 
before closing.

> Partition sender shouldn't send the "last batch" to a receiver that sent a 
> "receiver finished" to the sender
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-3845
>                 URL: https://issues.apache.org/jira/browse/DRILL-3845
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 1.3.0
>
>
> Even if a receiver has finished and informed the corresponding partition 
> sender, the sender will still try to send a "last batch" to the receiver when 
> it's done. In most cases this is fine as those batches will be silently 
> dropped by the receiving DataServer, but if a receiver has finished +10 
> minutes ago, DataServer will throw an exception as it couldn't find the 
> corresponding FragmentManager (WorkEventBus has a 10 minutes recentlyFinished 
> cache).
> DRILL-2274 is a reproduction for this case (after the corresponding fix is 
> applied).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to