[ 
https://issues.apache.org/jira/browse/DRILL-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992264#comment-14992264
 ] 

Deneche A. Hakim commented on DRILL-3845:
-----------------------------------------

Sorry for the confusion, I should've closed the pull request.

[~sphillips] that was the point of my comment, the fix needs to go into the 
unordered receiver to make sure all senders/receivers follow the same protocol.

[~jnadeau] for this particular case, the query starts 3 intermediate fragments 
that contain a hash join. Because the data is heavily skewed, 2 of those 
fragments don't get any data on their right table and finish right away, the 
3rd fragment get all the data and it takes nearly 1 hour to finish. When the 
query is finally done, the leaf senders send the "final batch" to the fragment 
that already finished, and the WorkEventBus throws an exception because it 
can't find those fragments anywhere (cache expires after 10 minutes).

I am running the query one more time and will attach the profile when it fails 
(should take about an hour).

> Partition sender shouldn't send the "last batch" to a receiver that sent a 
> "receiver finished" to the sender
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-3845
>                 URL: https://issues.apache.org/jira/browse/DRILL-3845
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 1.4.0
>
>
> Even if a receiver has finished and informed the corresponding partition 
> sender, the sender will still try to send a "last batch" to the receiver when 
> it's done. In most cases this is fine as those batches will be silently 
> dropped by the receiving DataServer, but if a receiver has finished +10 
> minutes ago, DataServer will throw an exception as it couldn't find the 
> corresponding FragmentManager (WorkEventBus has a 10 minutes recentlyFinished 
> cache).
> DRILL-2274 is a reproduction for this case (after the corresponding fix is 
> applied).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to