WorkQueue maintains all the cluster management event processing threads, but it doesn't include heartbeat processing. Those management events may deserve a high priority, maybe NORM_PRIORITY is OK. Real data processing operators are run in org.apache.hyracks.control.nc.Task, where we already set their priority to be Thread.MIN_PRIORITY (line 270).
Heartbeat processing is separated in org.apache.hyracks.control.nc.NodeControllerService (line 294): timer.schedule(heartbeatTask, 0, nodeParameters.getHeartbeatPeriod()); I guess we can define our own timer thread, set the MAX_PRIORITY for it, and see if it works. Best, Yingyi On Fri, Sep 11, 2015 at 3:00 PM, Till Westmann (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/ASTERIXDB-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741602#comment-14741602 > ] > > Till Westmann commented on ASTERIXDB-1076: > ------------------------------------------ > > In org.apache.hyracks.control.common.work.WorkQueue we set the priority of > every new WorkerThread to MAX_PRIORITY. Maybe it would be sufficient to > reserve MAX_PRIORITY for Heartbeat threads? > > > False failures cause denying new queries > > ---------------------------------------- > > > > Key: ASTERIXDB-1076 > > URL: > https://issues.apache.org/jira/browse/ASTERIXDB-1076 > > Project: Apache AsterixDB > > Issue Type: Bug > > Components: AsterixDB > > Reporter: Yingyi Bu > > Priority: Critical > > > > When CPUs in the cluster are saturated for computations, the heartbeat > from slave nodes to the master node might get delayed. In this case, the > master node thinks a node fails, and can no longer adds the node back. > Hence, the entire cluster is not usable and an instance restart is needed. > > Two things need to be fixed: > > 1. (at least) expose AsterixDB configuration parameters to allow users > to set a large heartbeat threshold; > > 2. allow a node to leave and re-join a hyracks cluster. > > In the long term, we might need to investigate better liveness check > strategies. > > To reproduce that issue, just let slave nodes' CPUs overloaded and you > will see that. > > The exception " Asterix Cluster Global recovery is not yet complete and > The system is in ACTIVE state" will be thrown for upcoming queries. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >
