I encountered a QD timeout issue recently. This happens when: 1) One of the segment node panick and then restarted. 2) Other segments hangs in interconnect (there will be retry until timeout).
QD stays at loop of poll() until either QE reports error after interconnect timeout or libpq (QD<->QE) reports error with timeout since the socket is configured with kernel tcp keepalive. This is bad since the default timeout seconds of both detection solutions are long (1 hour and 2hours on my test systems) although we could modify the default values, I'm wondering if we could have a better and controllable solution - To use the RM heartbeat mechanism: RM maintains a global ID lists (stable cross node adding or removing) for all nodes and keeps updating the health state via userspace heartbeat mechanism, thus we could maintain a bitmap in shared memory which keeps the latest node healthy info updated then we could use it in QD code, i.e. Cancel the query if finding the segment node, which handles part of the query, is down. Any idea? Thanks.
