[
https://issues.apache.org/jira/browse/ASTERIXDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741637#comment-14741637
]
Yingyi Bu commented on ASTERIXDB-1076:
--------------------------------------
WorkQueue maintains all the cluster management event processing threads,
but it doesn't include heartbeat processing. Those management events may
deserve a high priority, maybe NORM_PRIORITY is OK.
Real data processing operators are run in
org.apache.hyracks.control.nc.Task, where we already set their priority to
be Thread.MIN_PRIORITY (line 270).
Heartbeat processing is separated in
org.apache.hyracks.control.nc.NodeControllerService (line 294):
timer.schedule(heartbeatTask, 0, nodeParameters.getHeartbeatPeriod());
I guess we can define our own timer thread, set the MAX_PRIORITY for it,
and see if it works.
Best,
Yingyi
On Fri, Sep 11, 2015 at 3:00 PM, Till Westmann (JIRA) <[email protected]>
> False failures cause denying new queries
> ----------------------------------------
>
> Key: ASTERIXDB-1076
> URL: https://issues.apache.org/jira/browse/ASTERIXDB-1076
> Project: Apache AsterixDB
> Issue Type: Bug
> Components: AsterixDB
> Reporter: Yingyi Bu
> Priority: Critical
>
> When CPUs in the cluster are saturated for computations, the heartbeat from
> slave nodes to the master node might get delayed. In this case, the master
> node thinks a node fails, and can no longer adds the node back. Hence, the
> entire cluster is not usable and an instance restart is needed.
> Two things need to be fixed:
> 1. (at least) expose AsterixDB configuration parameters to allow users to
> set a large heartbeat threshold;
> 2. allow a node to leave and re-join a hyracks cluster.
> In the long term, we might need to investigate better liveness check
> strategies.
> To reproduce that issue, just let slave nodes' CPUs overloaded and you will
> see that.
> The exception " Asterix Cluster Global recovery is not yet complete and The
> system is in ACTIVE state" will be thrown for upcoming queries.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)