[ https://issues.apache.org/jira/browse/HAWQ-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Guo reassigned HAWQ-1326: ------------------------------ Assignee: Paul Guo (was: Ed Espino) > Cancel the query if one of the segments for the query crashes > ------------------------------------------------------------- > > Key: HAWQ-1326 > URL: https://issues.apache.org/jira/browse/HAWQ-1326 > Project: Apache HAWQ > Issue Type: Bug > Reporter: Paul Guo > Assignee: Paul Guo > Fix For: 2.2.0.0-incubating > > > QD thread could hang in the loop of poll() since: 1) The alive segments could > wait at the interconnect for the dead segment until interconnect timeout (by > default 1 hour). 2) In the QD thread poll() will not sense the system-down > until kernel tcp keepalive messaging is triggered, however the keepalive > timeout is a bit long (2 hours by default on rhel6.x) and it could be > configured via procfs only. > A proper solution would be using the RM heartbeat mechanism: > RM maintains a global ID lists (stable cross node adding or removing) for all > nodes and keeps updating the health state via userspace heartbeat mechanism, > thus we could maintain a bitmap in shared memory which keeps the latest node > healthy info updated then we could use it in QD code, i.e. Cancel the query > if finding the segment node, which handles part of the query, is down. -- This message was sent by Atlassian JIRA (v6.3.15#6346)