Hi Taewoo,

I’m little confused as we seem to have 2 cc logs.

The "Requesting node 2 to shutdown …" message is logged, when nodes are considered to be dead as they have missed too many heartbeats. And you can see in the later logs that indeed more than 80s passed without receiving a heartbeat.
Could it be that the machines were too busy to send/receive heartbeats?

Cheers,
Till

On 30 Sep 2018, at 10:52, Taewoo Kim wrote:

Hi devs,

I saw the following log messages in the cc.log of the Cloudberrry cluster.
Yesterday around 9:50 pm, a task cluster was created and executed. And
around 9:51 pm, the cluster shows the following message. In the nc.log of
the node 2, there were no noticeable log messages. I wonder what could
cause this situation.

-- cc.log at 9:51 pm
*21:51:54.203 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 2 to
shutdown to ensure failure*

*-- nc-2.log around 9:50 pm*
21:50:30.314 [Worker:2] INFO
 org.apache.hyracks.control.nc.work.StartTasksWork - input: 0: CDID:2
21:50:30.375 [Worker:2] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
NotifyTaskCompleteWork:TAID:TID:ANID:ODID:7:0:3:0
21:50:30.382 [Worker:2] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
NotifyTaskCompleteWork:TAID:TID:ANID:ODID:9:0:3:0
21:50:30.390 [Worker:2] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
NotifyTaskCompleteWork:TAID:TID:ANID:ODID:9:0:2:0
21:50:30.391 [Worker:2] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
NotifyTaskCompleteWork:TAID:TID:ANID:ODID:7:0:2:0
21:50:30.392 [Worker:2] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:3:0
21:50:30.392 [Worker:2] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:2:0
21:50:30.398 [Worker:2] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:3:0
21:50:30.398 [Worker:2] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:2:0
22:02:30.290 [JVM exit thread] INFO org.apache.hyracks.util.ExitUtil - JVM
exiting with status 0; bye!


-- *cc.log from 9:51 pm*
21:50:30.313 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.executor.ActivityClusterPlanner - Built 1
Task Clusters
21:50:30.313 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.executor.ActivityClusterPlanner - Tasks:
[TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1, TID:ANID:ODID:1:1:2,
TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4, TID:ANID:ODID:1:1:5,
TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7, TID:ANID:ODID:1:1:8,
TID:ANID:ODID:1:1:9, TID:ANID:ODID:2:0:0, TID:ANID:ODID:2:0:1,
TID:ANID:ODID:2:0:2, TID:ANID:ODID:2:0:3, TID:ANID:ODID:2:0:4,
TID:ANID:ODID:2:0:5, TID:ANID:ODID:2:0:6, TID:ANID:ODID:2:0:7,
TID:ANID:ODID:2:0:8, TID:ANID:ODID:2:0:9, TID:ANID:ODID:9:0:0,
TID:ANID:ODID:9:0:1, TID:ANID:ODID:9:0:2, TID:ANID:ODID:9:0:3,
TID:ANID:ODID:9:0:4, TID:ANID:ODID:9:0:5, TID:ANID:ODID:9:0:6,
TID:ANID:ODID:9:0:7, TID:ANID:ODID:9:0:8, TID:ANID:ODID:9:0:9]
21:50:30.313 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.executor.JobExecutor - Runnable TC roots:
[TC:[TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1, TID:ANID:ODID:1:1:2,
TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4, TID:ANID:ODID:1:1:5,
TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7, TID:ANID:ODID:1:1:8,
TID:ANID:ODID:1:1:9, TID:ANID:ODID:2:0:0, TID:ANID:ODID:2:0:1,
TID:ANID:ODID:2:0:2, TID:ANID:ODID:2:0:3, TID:ANID:ODID:2:0:4,
TID:ANID:ODID:2:0:5, TID:ANID:ODID:2:0:6, TID:ANID:ODID:2:0:7,
TID:ANID:ODID:2:0:8, TID:ANID:ODID:2:0:9, TID:ANID:ODID:9:0:0,
TID:ANID:ODID:9:0:1, TID:ANID:ODID:9:0:2, TID:ANID:ODID:9:0:3,
TID:ANID:ODID:9:0:4, TID:ANID:ODID:9:0:5, TID:ANID:ODID:9:0:6,
TID:ANID:ODID:9:0:7, TID:ANID:ODID:9:0:8, TID:ANID:ODID:9:0:9]],
inProgressTaskClusters: []
21:50:30.366 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:7:0:5:0]
21:50:30.374 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:7:0:9:0]
21:50:30.376 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:7:0:3:0]
21:50:30.376 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:7:0:8:0]
21:50:30.377 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:7:0:0:0]
21:50:30.378 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:7:0:4:0]
21:50:30.381 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:7:0:1:0]
21:50:30.382 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:9:0:8:0]
21:50:30.383 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:9:0:3:0]
21:50:30.384 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:9:0:4:0]
21:50:30.386 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:9:0:5:0]
21:50:30.386 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:9:0:9:0]
21:50:30.389 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:9:0:6:0]
21:50:30.389 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:9:0:0:0]
21:50:30.392 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:9:0:2:0]
21:50:30.392 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:7:0:7:0]
21:50:30.392 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:7:0:6:0]
21:50:30.394 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:9:0:1:0]
21:50:30.395 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:7:0:2:0]
21:50:30.395 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:9:0:7:0]
21:50:30.396 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:1:1:1:0]
21:50:30.396 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:1:1:0:0]
21:50:30.396 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:1:1:0:0]
21:50:30.396 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:1:1:1:0]
21:50:30.397 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:1:1:8:0]
21:50:30.398 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:1:1:4:0]
21:50:30.398 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:1:1:5:0]
21:50:30.398 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:1:1:4:0]
21:50:30.398 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:1:1:5:0]
21:50:30.399 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:1:1:6:0]
21:50:30.399 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:1:1:3:0]
21:50:30.399 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:1:1:2:0]
21:50:30.399 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:1:1:9:0]
21:50:30.399 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:1:1:8:0]
21:50:30.399 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:1:1:9:0]
21:50:30.402 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:1:1:7:0]
21:50:30.402 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:1:1:6:0]
21:50:30.402 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:1:1:7:0]
21:50:30.402 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:5:0]
21:50:30.402 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:4:0]
21:50:30.402 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:5:0]
21:50:30.402 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:4:0]
21:50:30.404 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:1:1:3:0]
21:50:30.404 [Worker:ClusterController] INFO
 org.apache.hyracks.control.common.work.WorkQueue - Executing:
TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:1:1:2:0]
*21:51:54.203 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 2 to
shutdown to ensure failure*
21:51:54.206 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - Request to shutdown
failed node 2 succeeded. false positive heartbeat miss indication
21:51:54.206 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - 2 considered dead.
Last heartbeat received 83799ms ago. Max miss period: 80000ms
21:51:54.206 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 5 to
shutdown to ensure failure
21:51:54.206 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - Request to shutdown
failed node 5 succeeded. false positive heartbeat miss indication
21:51:54.206 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - 5 considered dead.
Last heartbeat received 83806ms ago. Max miss period: 80000ms
21:51:54.207 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 3 to
shutdown to ensure failure
21:51:54.207 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - Request to shutdown
failed node 3 succeeded. false positive heartbeat miss indication
21:51:54.207 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - 3 considered dead.
Last heartbeat received 83804ms ago. Max miss period: 80000ms
21:51:54.207 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 4 to
shutdown to ensure failure
21:51:54.207 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - Request to shutdown
failed node 4 succeeded. false positive heartbeat miss indication
21:51:54.207 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.cluster.NodeManager - 4 considered dead.
Last heartbeat received 83804ms ago. Max miss period: 80000ms
21:51:54.207 [Worker:ClusterController] INFO
 org.apache.hyracks.control.cc.work.RemoveDeadNodesWork - Number of
affected jobs: 3
21:51:54.216 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.executor.JobExecutor - Abort map for job:
JID:0.1: {1=[TAID:TID:ANID:ODID:0:0:0:0, TAID:TID:ANID:ODID:3:0:0:0,
TAID:TID:ANID:ODID:3:0:1:0]}
21:51:54.216 [Worker:ClusterController] INFO
 org.apache.hyracks.control.cc.executor.JobExecutor - Aborting:
[TAID:TID:ANID:ODID:0:0:0:0, TAID:TID:ANID:ODID:3:0:0:0,
TAID:TID:ANID:ODID:3:0:1:0] at 1
21:51:54.217 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.partitions.PartitionMatchMaker - Removing
uncommitted partitions: []
21:51:54.217 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.partitions.PartitionMatchMaker - Removing
partition requests: []
21:51:54.218 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.executor.JobExecutor - aborting doomed task
clusters
21:51:54.218 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.executor.JobExecutor - number of doomed task
clusters found = 0
21:51:54.218 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.executor.JobExecutor - Runnable TC roots:
[TC:[TID:ANID:ODID:0:0:0, TID:ANID:ODID:3:0:0, TID:ANID:ODID:3:0:1,
TID:ANID:ODID:3:0:2, TID:ANID:ODID:3:0:3, TID:ANID:ODID:3:0:4,
TID:ANID:ODID:3:0:5, TID:ANID:ODID:3:0:6, TID:ANID:ODID:3:0:7,
TID:ANID:ODID:3:0:8, TID:ANID:ODID:3:0:9]], inProgressTaskClusters: []
21:51:54.218 [Worker:ClusterController] ERROR
org.apache.hyracks.control.cc.executor.JobExecutor - Unexpected failure.
Aborting job JID:0.1
org.apache.hyracks.api.exceptions.HyracksException: Node 2 not live
        at
org.apache.hyracks.control.cc.executor.JobExecutor.assignLocation(JobExecutor.java:473)
~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
        at
org.apache.hyracks.control.cc.executor.JobExecutor.assignTaskLocations(JobExecutor.java:365)
~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
        at
org.apache.hyracks.control.cc.executor.JobExecutor.startRunnableTaskClusters(JobExecutor.java:245)
~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
        at
org.apache.hyracks.control.cc.executor.JobExecutor.startRunnableActivityClusters(JobExecutor.java:209)
~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
        at
org.apache.hyracks.control.cc.executor.JobExecutor.notifyNodeFailures(JobExecutor.java:731)
[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
        at
org.apache.hyracks.control.cc.work.RemoveDeadNodesWork.run(RemoveDeadNodesWork.java:60)
[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
        at
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:127)
[hyracks-control-common-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT]
21:51:54.218 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.executor.JobExecutor - Abort map for job:
JID:0.700: {}
21:51:54.218 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.partitions.PartitionMatchMaker - Removing
uncommitted partitions: []
21:51:54.218 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.partitions.PartitionMatchMaker - Removing
partition requests: []
21:51:54.218 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.executor.JobExecutor - aborting doomed task
clusters
21:51:54.219 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.executor.JobExecutor - number of doomed task
clusters found = 0
21:51:54.219 [Worker:ClusterController] INFO
org.apache.hyracks.control.cc.executor.JobExecutor - Abort map for job: JID:0.700: {1=[TAID:TID:ANID:ODID:2:0:0:0, TAID:TID:ANID:ODID:2:0:1:0]}

Best,
Taewoo

Reply via email to