[ 
https://issues.apache.org/jira/browse/ASTERIXDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741761#comment-14741761
 ] 

Ian Maxon commented on ASTERIXDB-1076:
--------------------------------------

Oh, it's good that the heartbeats are at least not stuck in the big ol' 
WorkQueue. I was under the impression that was how it was. 

For addressing 1), the parameters for controlling heartbeat interval exist in 
Hyracks but they're command line args to the CC. So actually it is possible to 
change them, you just put them in the normal place where -Xmx and so on belong 
in the asterix-configuration.xml (I think, haven't tried... :) ) 
It'd probably be easier/clearer to migrate them to be their own attributes in 
that file, otherwise it's kind of impossible to tell that the option exists in 
the first place. 

> False failures cause denying new queries
> ----------------------------------------
>
>                 Key: ASTERIXDB-1076
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1076
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: AsterixDB
>            Reporter: Yingyi Bu
>            Assignee: Yingyi Bu
>            Priority: Critical
>
> When CPUs in the cluster are saturated for computations,  the heartbeat from 
> slave nodes to the master node might get delayed.  In this case, the master 
> node thinks a node fails, and can no longer adds the node back.  Hence, the 
> entire cluster is not usable and an instance restart is needed.
> Two things need to be fixed:
> 1.  (at least) expose AsterixDB configuration parameters to allow users to 
> set a large heartbeat threshold;
> 2.  allow a node to leave and re-join a hyracks cluster.
> In the long term, we might need to investigate better liveness check 
> strategies.
> To reproduce that issue,  just let slave nodes' CPUs overloaded and you will 
> see that.
> The exception " Asterix Cluster Global recovery is not yet complete and The 
> system is in ACTIVE state" will be thrown for upcoming queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to