[ 
https://issues.apache.org/jira/browse/ASTERIXDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217784#comment-16217784
 ] 

ASF subversion and git services commented on ASTERIXDB-1076:
------------------------------------------------------------

Commit 8734988fae1d32dce74ec9fff5cee0dd5c2e5d15 in asterixdb's branch 
refs/heads/master from [~mblow]
[ https://git-wip-us.apache.org/repos/asf?p=asterixdb.git;h=8734988 ]

[ASTERIXDB-1076][HYR] Prevent node death false positives

- Measure actual time since last heartbeat touched, not based on number
  of dead cycle detections since last heartbeat received
- Update heartbeat touch on job result received, in addition to when
  heartbeat data is received
- Minor refactoring in NC/CC config

Change-Id: Idb1abcc2b783b192b88ed988d398fcfe763531e9
Reviewed-on: https://asterix-gerrit.ics.uci.edu/2097
Sonar-Qube: Jenkins <[email protected]>
Tested-by: Jenkins <[email protected]>
Contrib: Jenkins <[email protected]>
Integration-Tests: Jenkins <[email protected]>
Reviewed-by: Ian Maxon <[email protected]>


> False failures cause denying new queries
> ----------------------------------------
>
>                 Key: ASTERIXDB-1076
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1076
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: HYR - Hyracks
>            Reporter: Yingyi Bu
>            Assignee: Michael Blow
>              Labels: soon
>
> When CPUs in the cluster are saturated for computations,  the heartbeat from 
> slave nodes to the master node might get delayed.  In this case, the master 
> node thinks a node fails, and can no longer adds the node back.  Hence, the 
> entire cluster is not usable and an instance restart is needed.
> Two things need to be fixed:
> 1.  (at least) expose AsterixDB configuration parameters to allow users to 
> set a large heartbeat threshold;
> 2.  allow a node to leave and re-join a hyracks cluster.
> In the long term, we might need to investigate better liveness check 
> strategies.
> To reproduce that issue,  just let slave nodes' CPUs overloaded and you will 
> see that.
> The exception " Asterix Cluster Global recovery is not yet complete and The 
> system is in ACTIVE state" will be thrown for upcoming queries.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to