[
https://issues.apache.org/jira/browse/ASTERIXDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yingyi Bu updated ASTERIXDB-1076:
---------------------------------
Description:
When CPUs in the cluster are saturated for computations, the heartbeat from
slave nodes to the master node might get delayed. In this case, the master
node thinks a node fails, and can no longer adds the node back. Hence, the
entire cluster is not usable and an instance restart is needed.
Two things need to be fixed:
1. (at least) expose AsterixDB configuration parameters to allow users to set
a large heartbeat threshold;
2. allow a node to leave and re-join a hyracks cluster.
In the long term, we might need to investigate better liveness check strategies.
To reproduce that issue, just let slave nodes' CPUs overloaded and you will
see that.
The exception " Asterix Cluster Global recovery is not yet complete and The
system is in ACTIVE state" will be thrown for upcoming queries.
was:
When CPUs in the cluster are saturated for computations, the heartbeat from
slave nodes to the master node might get delayed. In this case, the master
node thinks a node fails, and can no longer adds the node back. Hence, the
entire cluster is not usable and an instance restart is needed.
Two things need to be fixed:
1. (at least) expose AsterixDB configuration parameters to allow users to set
a large heartbeat threshold;
2. allow a node to leave and re-join a hyracks cluster.
In the long term, we might need to investigate better liveness check strategies.
To reproduce that issue, just let slave nodes' CPUs overloaded and you will
see that.
Summary: False failures cause denying new queries (was: False failures
triggers denying new queries)
> False failures cause denying new queries
> ----------------------------------------
>
> Key: ASTERIXDB-1076
> URL: https://issues.apache.org/jira/browse/ASTERIXDB-1076
> Project: Apache AsterixDB
> Issue Type: Bug
> Components: AsterixDB
> Reporter: Yingyi Bu
> Priority: Critical
>
> When CPUs in the cluster are saturated for computations, the heartbeat from
> slave nodes to the master node might get delayed. In this case, the master
> node thinks a node fails, and can no longer adds the node back. Hence, the
> entire cluster is not usable and an instance restart is needed.
> Two things need to be fixed:
> 1. (at least) expose AsterixDB configuration parameters to allow users to
> set a large heartbeat threshold;
> 2. allow a node to leave and re-join a hyracks cluster.
> In the long term, we might need to investigate better liveness check
> strategies.
> To reproduce that issue, just let slave nodes' CPUs overloaded and you will
> see that.
> The exception " Asterix Cluster Global recovery is not yet complete and The
> system is in ACTIVE state" will be thrown for upcoming queries.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)