[ 
https://issues.apache.org/jira/browse/DRILL-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088156#comment-15088156
 ] 

Deneche A. Hakim commented on DRILL-3845:
-----------------------------------------

We've seen this issue once again in a different query. An intermediate fragment 
contains a hashjoin, the left side is generating lot's of data (it's a view 
that contains 2 flatten operators) and takes more than 10 minutes to finish 
sending all it's data. The right side is really small and sends everything in 
less than 2 seconds. 
For some reason (maybe a skew caused by our hashing function) some fragments 
don't receive any data at all on both sides and the hashjoin stops the 
fragment. But because the left side didn't send any data either, it will send 
the "last batch" when it's done, 10 minutes later, and the query fails because 
the fragment is not even in the recently finished cache.

The proposed fix updates PartitionSender to not send the "last batch" for any 
receiver that sent an early termination request.

> PartitionSender doesn't send last batch for receivers that already terminated
> -----------------------------------------------------------------------------
>
>                 Key: DRILL-3845
>                 URL: https://issues.apache.org/jira/browse/DRILL-3845
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 1.5.0
>
>         Attachments: 29c45a5b-e2b9-72d6-89f2-d49ba88e2939.sys.drill
>
>
> Even if a receiver has finished and informed the corresponding partition 
> sender, the sender will still try to send a "last batch" to the receiver when 
> it's done. In most cases this is fine as those batches will be silently 
> dropped by the receiving DataServer, but if a receiver has finished +10 
> minutes ago, DataServer will throw an exception as it couldn't find the 
> corresponding FragmentManager (WorkEventBus has a 10 minutes recentlyFinished 
> cache).
> DRILL-2274 is a reproduction for this case (after the corresponding fix is 
> applied).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to