Hi Till, These files are indeed the same log file. I just extracted one line from the original log and placed the line at the beginning. At 9:50 pm, the CC initiates a task cluster and all NCs were executing. Node 2 sent the taskcomplete message, and somehow the CC decides to shut down the node 2. That's my understanding.
Best, Taewoo On Sun, Sep 30, 2018 at 12:40 PM, Till Westmann <[email protected]> wrote: > Hi Taewoo, > > I’m little confused as we seem to have 2 cc logs. > > The "Requesting node 2 to shutdown …" message is logged, when nodes are > considered to be dead as they have missed too many heartbeats. And you can > see in the later logs that indeed more than 80s passed without receiving a > heartbeat. > Could it be that the machines were too busy to send/receive heartbeats? > > Cheers, > Till > > On 30 Sep 2018, at 10:52, Taewoo Kim wrote: > > Hi devs, >> >> I saw the following log messages in the cc.log of the Cloudberrry cluster. >> Yesterday around 9:50 pm, a task cluster was created and executed. And >> around 9:51 pm, the cluster shows the following message. In the nc.log of >> the node 2, there were no noticeable log messages. I wonder what could >> cause this situation. >> >> -- cc.log at 9:51 pm >> *21:51:54.203 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 2 to >> shutdown to ensure failure* >> >> *-- nc-2.log around 9:50 pm* >> >> 21:50:30.314 [Worker:2] INFO >> org.apache.hyracks.control.nc.work.StartTasksWork - input: 0: CDID:2 >> 21:50:30.375 [Worker:2] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:7:0:3:0 >> 21:50:30.382 [Worker:2] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:9:0:3:0 >> 21:50:30.390 [Worker:2] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:9:0:2:0 >> 21:50:30.391 [Worker:2] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:7:0:2:0 >> 21:50:30.392 [Worker:2] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:3:0 >> 21:50:30.392 [Worker:2] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:2:0 >> 21:50:30.398 [Worker:2] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:3:0 >> 21:50:30.398 [Worker:2] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> NotifyTaskCompleteWork:TAID:TID:ANID:ODID:1:1:2:0 >> 22:02:30.290 [JVM exit thread] INFO org.apache.hyracks.util.ExitUtil - >> JVM >> exiting with status 0; bye! >> >> >> -- *cc.log from 9:51 pm* >> >> 21:50:30.313 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.executor.ActivityClusterPlanner - Built 1 >> Task Clusters >> 21:50:30.313 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.executor.ActivityClusterPlanner - Tasks: >> [TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1, TID:ANID:ODID:1:1:2, >> TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4, TID:ANID:ODID:1:1:5, >> TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7, TID:ANID:ODID:1:1:8, >> TID:ANID:ODID:1:1:9, TID:ANID:ODID:2:0:0, TID:ANID:ODID:2:0:1, >> TID:ANID:ODID:2:0:2, TID:ANID:ODID:2:0:3, TID:ANID:ODID:2:0:4, >> TID:ANID:ODID:2:0:5, TID:ANID:ODID:2:0:6, TID:ANID:ODID:2:0:7, >> TID:ANID:ODID:2:0:8, TID:ANID:ODID:2:0:9, TID:ANID:ODID:9:0:0, >> TID:ANID:ODID:9:0:1, TID:ANID:ODID:9:0:2, TID:ANID:ODID:9:0:3, >> TID:ANID:ODID:9:0:4, TID:ANID:ODID:9:0:5, TID:ANID:ODID:9:0:6, >> TID:ANID:ODID:9:0:7, TID:ANID:ODID:9:0:8, TID:ANID:ODID:9:0:9] >> 21:50:30.313 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.executor.JobExecutor - Runnable TC roots: >> [TC:[TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1, TID:ANID:ODID:1:1:2, >> TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4, TID:ANID:ODID:1:1:5, >> TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7, TID:ANID:ODID:1:1:8, >> TID:ANID:ODID:1:1:9, TID:ANID:ODID:2:0:0, TID:ANID:ODID:2:0:1, >> TID:ANID:ODID:2:0:2, TID:ANID:ODID:2:0:3, TID:ANID:ODID:2:0:4, >> TID:ANID:ODID:2:0:5, TID:ANID:ODID:2:0:6, TID:ANID:ODID:2:0:7, >> TID:ANID:ODID:2:0:8, TID:ANID:ODID:2:0:9, TID:ANID:ODID:9:0:0, >> TID:ANID:ODID:9:0:1, TID:ANID:ODID:9:0:2, TID:ANID:ODID:9:0:3, >> TID:ANID:ODID:9:0:4, TID:ANID:ODID:9:0:5, TID:ANID:ODID:9:0:6, >> TID:ANID:ODID:9:0:7, TID:ANID:ODID:9:0:8, TID:ANID:ODID:9:0:9]], >> inProgressTaskClusters: [] >> 21:50:30.366 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:7:0:5:0] >> 21:50:30.374 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:7:0:9:0] >> 21:50:30.376 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:7:0:3:0] >> 21:50:30.376 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:7:0:8:0] >> 21:50:30.377 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:7:0:0:0] >> 21:50:30.378 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:7:0:4:0] >> 21:50:30.381 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:7:0:1:0] >> 21:50:30.382 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:9:0:8:0] >> 21:50:30.383 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:9:0:3:0] >> 21:50:30.384 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:9:0:4:0] >> 21:50:30.386 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:9:0:5:0] >> 21:50:30.386 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:9:0:9:0] >> 21:50:30.389 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:9:0:6:0] >> 21:50:30.389 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:9:0:0:0] >> 21:50:30.392 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:9:0:2:0] >> 21:50:30.392 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:7:0:7:0] >> 21:50:30.392 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:7:0:6:0] >> 21:50:30.394 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:9:0:1:0] >> 21:50:30.395 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:7:0:2:0] >> 21:50:30.395 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:9:0:7:0] >> 21:50:30.396 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:1:1:1:0] >> 21:50:30.396 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:1:1:0:0] >> 21:50:30.396 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [1[JID:0.700:TAID:TID:ANID:ODID:1:1:0:0] >> 21:50:30.396 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [1[JID:0.701:TAID:TID:ANID:ODID:1:1:1:0] >> 21:50:30.397 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:1:1:8:0] >> 21:50:30.398 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:1:1:4:0] >> 21:50:30.398 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:1:1:5:0] >> 21:50:30.398 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:1:1:4:0] >> 21:50:30.398 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [3[JID:0.701:TAID:TID:ANID:ODID:1:1:5:0] >> 21:50:30.399 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:1:1:6:0] >> 21:50:30.399 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:1:1:3:0] >> 21:50:30.399 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [2[JID:0.700:TAID:TID:ANID:ODID:1:1:2:0] >> 21:50:30.399 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [5[JID:0.700:TAID:TID:ANID:ODID:1:1:9:0] >> 21:50:30.399 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:1:1:8:0] >> 21:50:30.399 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [5[JID:0.701:TAID:TID:ANID:ODID:1:1:9:0] >> 21:50:30.402 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [4[JID:0.700:TAID:TID:ANID:ODID:1:1:7:0] >> 21:50:30.402 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:1:1:6:0] >> 21:50:30.402 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [4[JID:0.701:TAID:TID:ANID:ODID:1:1:7:0] >> 21:50:30.402 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:5:0] >> 21:50:30.402 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:4:0] >> 21:50:30.402 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:5:0] >> 21:50:30.402 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [3[JID:0.700:TAID:TID:ANID:ODID:2:0:4:0] >> 21:50:30.404 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:1:1:3:0] >> 21:50:30.404 [Worker:ClusterController] INFO >> org.apache.hyracks.control.common.work.WorkQueue - Executing: >> TaskComplete: [2[JID:0.701:TAID:TID:ANID:ODID:1:1:2:0] >> *21:51:54.203 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 2 to >> shutdown to ensure failure* >> >> 21:51:54.206 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - Request to shutdown >> failed node 2 succeeded. false positive heartbeat miss indication >> 21:51:54.206 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - 2 considered dead. >> Last heartbeat received 83799ms ago. Max miss period: 80000ms >> 21:51:54.206 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 5 to >> shutdown to ensure failure >> 21:51:54.206 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - Request to shutdown >> failed node 5 succeeded. false positive heartbeat miss indication >> 21:51:54.206 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - 5 considered dead. >> Last heartbeat received 83806ms ago. Max miss period: 80000ms >> 21:51:54.207 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 3 to >> shutdown to ensure failure >> 21:51:54.207 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - Request to shutdown >> failed node 3 succeeded. false positive heartbeat miss indication >> 21:51:54.207 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - 3 considered dead. >> Last heartbeat received 83804ms ago. Max miss period: 80000ms >> 21:51:54.207 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - Requesting node 4 to >> shutdown to ensure failure >> 21:51:54.207 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - Request to shutdown >> failed node 4 succeeded. false positive heartbeat miss indication >> 21:51:54.207 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.cluster.NodeManager - 4 considered dead. >> Last heartbeat received 83804ms ago. Max miss period: 80000ms >> 21:51:54.207 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.work.RemoveDeadNodesWork - Number of >> affected jobs: 3 >> 21:51:54.216 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.executor.JobExecutor - Abort map for job: >> JID:0.1: {1=[TAID:TID:ANID:ODID:0:0:0:0, TAID:TID:ANID:ODID:3:0:0:0, >> TAID:TID:ANID:ODID:3:0:1:0]} >> 21:51:54.216 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.executor.JobExecutor - Aborting: >> [TAID:TID:ANID:ODID:0:0:0:0, TAID:TID:ANID:ODID:3:0:0:0, >> TAID:TID:ANID:ODID:3:0:1:0] at 1 >> 21:51:54.217 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.partitions.PartitionMatchMaker - Removing >> uncommitted partitions: [] >> 21:51:54.217 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.partitions.PartitionMatchMaker - Removing >> partition requests: [] >> 21:51:54.218 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.executor.JobExecutor - aborting doomed >> task >> clusters >> 21:51:54.218 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.executor.JobExecutor - number of doomed >> task >> clusters found = 0 >> 21:51:54.218 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.executor.JobExecutor - Runnable TC roots: >> [TC:[TID:ANID:ODID:0:0:0, TID:ANID:ODID:3:0:0, TID:ANID:ODID:3:0:1, >> TID:ANID:ODID:3:0:2, TID:ANID:ODID:3:0:3, TID:ANID:ODID:3:0:4, >> TID:ANID:ODID:3:0:5, TID:ANID:ODID:3:0:6, TID:ANID:ODID:3:0:7, >> TID:ANID:ODID:3:0:8, TID:ANID:ODID:3:0:9]], inProgressTaskClusters: [] >> 21:51:54.218 [Worker:ClusterController] ERROR >> org.apache.hyracks.control.cc.executor.JobExecutor - Unexpected failure. >> Aborting job JID:0.1 >> org.apache.hyracks.api.exceptions.HyracksException: Node 2 not live >> at >> org.apache.hyracks.control.cc.executor.JobExecutor.assignLoc >> ation(JobExecutor.java:473) >> ~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT] >> at >> org.apache.hyracks.control.cc.executor.JobExecutor.assignTas >> kLocations(JobExecutor.java:365) >> ~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT] >> at >> org.apache.hyracks.control.cc.executor.JobExecutor.startRunn >> ableTaskClusters(JobExecutor.java:245) >> ~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT] >> at >> org.apache.hyracks.control.cc.executor.JobExecutor.startRunn >> ableActivityClusters(JobExecutor.java:209) >> ~[hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT] >> at >> org.apache.hyracks.control.cc.executor.JobExecutor.notifyNod >> eFailures(JobExecutor.java:731) >> [hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT] >> at >> org.apache.hyracks.control.cc.work.RemoveDeadNodesWork.run(R >> emoveDeadNodesWork.java:60) >> [hyracks-control-cc-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT] >> at >> org.apache.hyracks.control.common.work.WorkQueue$WorkerThrea >> d.run(WorkQueue.java:127) >> [hyracks-control-common-0.3.4-SNAPSHOT.jar:0.3.4-SNAPSHOT] >> 21:51:54.218 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.executor.JobExecutor - Abort map for job: >> JID:0.700: {} >> 21:51:54.218 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.partitions.PartitionMatchMaker - Removing >> uncommitted partitions: [] >> 21:51:54.218 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.partitions.PartitionMatchMaker - Removing >> partition requests: [] >> 21:51:54.218 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.executor.JobExecutor - aborting doomed >> task >> clusters >> 21:51:54.219 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.executor.JobExecutor - number of doomed >> task >> clusters found = 0 >> 21:51:54.219 [Worker:ClusterController] INFO >> org.apache.hyracks.control.cc.executor.JobExecutor - Abort map for job: >> JID:0.700: {1=[TAID:TID:ANID:ODID:2:0:0:0, TAID:TID:ANID:ODID:2:0:1:0]} >> >> Best, >> Taewoo >> >
