[ 
https://issues.apache.org/jira/browse/HAWQ-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Guo reassigned HAWQ-1326:
------------------------------

    Assignee: Paul Guo  (was: Ed Espino)

> Cancel the query if one of the segments for the query crashes
> -------------------------------------------------------------
>
>                 Key: HAWQ-1326
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1326
>             Project: Apache HAWQ
>          Issue Type: Bug
>            Reporter: Paul Guo
>            Assignee: Paul Guo
>             Fix For: 2.2.0.0-incubating
>
>
> QD thread could hang in the loop of poll() since: 1) The alive segments could 
> wait at the interconnect for the dead segment until interconnect timeout (by 
> default 1 hour). 2) In the QD thread poll() will not sense the system-down 
> until kernel tcp keepalive messaging is triggered, however the keepalive 
> timeout is a bit long (2 hours by default on rhel6.x) and it could be 
> configured via procfs only.
> A proper solution would be using the RM heartbeat mechanism:
> RM maintains a global ID lists (stable cross node adding or removing) for all 
> nodes and keeps updating the health state via userspace heartbeat mechanism, 
> thus we could maintain a bitmap in shared memory which keeps the latest node 
> healthy info updated then we could use it in QD code, i.e. Cancel the query 
> if finding the segment node, which handles part of the query, is down.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to