[jira] [Commented] (DRILL-3167) When a query fails, Foreman should wait for all fragments to finish cleaning up before sending a FAILED state to the client

Deneche A. Hakim (JIRA) Wed, 01 Jul 2015 15:42:20 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-3167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611114#comment-14611114
 ]


Deneche A. Hakim commented on DRILL-3167:
-----------------------------------------

I noticed one more case that could make the fix not work as expected: 

- if a drillbit dies, QueryManager.DrillbitStatusListener will fail the query 
(after we fix DRILL-3448) if any fragment for that query was running on the 
drillbit. But it won't update the fragment's status and it will most likely 
stay in a non terminal state.
- The Foreman will then try to cancel the fragments, and will most likely fail 
to send the cancellation request which will cause QueryManager.SignalListener 
to call fragmentDone() (this is part of my change) but because the fragments 
where already been accounted for by DrillbitStatusListener we will get an 
exception "The finished node count exceeds the total node count"

[~jnadeau] is there a specific reason not to update the fragments' state of a 
dead drillbit to FAILED ? this way even though we may send a cancellation 
request to that fragment, we shouldn't get an exception

> When a query fails, Foreman should wait for all fragments to finish cleaning 
> up before sending a FAILED state to the client
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-3167
>                 URL: https://issues.apache.org/jira/browse/DRILL-3167
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 1.2.0
>
>         Attachments: DRILL-3167.1.patch.txt, DRILL-3267.2.patch.txt, 
> DRILL-3267.3.patch.txt, DRILL-3267.4.patch.txt
>
>
> TestDrillbitResilience.foreman_runTryEnd() exposes this problem intermittently
> The query fails and the Foreman reports the failure to the client which 
> removes the results listener associated to the failed query. 
> Sometimes, a data batch reaches the client after the FAILED state already 
> arrived, the client doesn't handle this properly and the corresponding buffer 
> is never released.
> Making the Foreman wait for all fragments to finish before sending the final 
> state should help avoid such scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-3167) When a query fails, Foreman should wait for all fragments to finish cleaning up before sending a FAILED state to the client

Reply via email to