Paul Guo created HAWQ-1326:
------------------------------
Summary: Cancel the query if one of the segments for the query
crashes
Key: HAWQ-1326
URL: https://issues.apache.org/jira/browse/HAWQ-1326
Project: Apache HAWQ
Issue Type: Bug
Reporter: Paul Guo
Assignee: Ed Espino
Fix For: 2.2.0.0-incubating
QD thread could hang in the loop of poll() since: 1) The alive segments could
wait at the interconnect for the dead segment until interconnect timeout (by
default 1 hour). 2) In the QD thread poll() will not sense the system-down
until kernel tcp keepalive messaging is triggered, however the keepalive
timeout is a bit long (2 hours by default on rhel6.x) and it could be
configured via procfs only.
A proper solution would be using the RM heartbeat mechanism:
RM maintains a global ID lists (stable cross node adding or removing) for all
nodes and keeps updating the health state via userspace heartbeat mechanism,
thus we could maintain a bitmap in shared memory which keeps the latest node
healthy info updated then we could use it in QD code, i.e. Cancel the query if
finding the segment node, which handles part of the query, is down.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)