Additional note: @Till: Yes. It happened again for the same hash-join query. As we can see in the bold part of the following CC.log, one node alone was executing for two hours.
Dec 01, 2016 10:41:56 PM org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner planActivityCluster INFO: Plan for org.apache.hyracks.api.job.ActivityCluster@383ecfdd Dec 01, 2016 10:41:56 PM org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner planActivityCluster INFO: Built 1 Task Clusters Dec 01, 2016 10:41:56 PM org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner planActivityCluster INFO: Tasks: [TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1, TID:ANID:ODID:1:1:2, TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4, TID:ANID:ODID:1:1:5, TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7, TID:ANID:ODID:1:1:8, TID:ANID:ODID:1:1:9, TID:ANID:ODID:1:1:10, TID:ANID:ODID:1:1:11, TID:ANID:ODID:1:1:12, TID:ANID:ODID:1:1:13, TID:ANID:ODID:1:1:14, TID:ANID:ODID:1:1:15, TID:ANID:ODID:1:1:16, TID:ANID:ODID:1:1:17, TID:ANID:ODID:4:0:0] Dec 01, 2016 10:43:18 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc3[JID:5:TAID:TID:ANID:ODID:1:1:5:0] Dec 01, 2016 10:43:22 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc4[JID:5:TAID:TID:ANID:ODID:1:1:7:0] Dec 01, 2016 10:43:23 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc9[JID:5:TAID:TID:ANID:ODID:1:1:16:0] Dec 01, 2016 10:43:28 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc2[JID:5:TAID:TID:ANID:ODID:1:1:2:0] Dec 01, 2016 10:43:31 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc2[JID:5:TAID:TID:ANID:ODID:1:1:3:0] Dec 01, 2016 10:43:34 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc5[JID:5:TAID:TID:ANID:ODID:1:1:8:0] Dec 01, 2016 10:43:40 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc9[JID:5:TAID:TID:ANID:ODID:1:1:17:0] Dec 01, 2016 10:43:41 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc4[JID:5:TAID:TID:ANID:ODID:1:1:6:0] Dec 01, 2016 10:43:49 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc7[JID:5:TAID:TID:ANID:ODID:1:1:12:0] Dec 01, 2016 10:43:51 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc1[JID:5:TAID:TID:ANID:ODID:1:1:1:0] Dec 01, 2016 10:43:53 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc5[JID:5:TAID:TID:ANID:ODID:1:1:9:0] Dec 01, 2016 10:43:58 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc7[JID:5:TAID:TID:ANID:ODID:1:1:13:0] Dec 01, 2016 10:44:25 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc8[JID:5:TAID:TID:ANID:ODID:1:1:14:0] Dec 01, 2016 10:44:29 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc6[JID:5:TAID:TID:ANID:ODID:1:1:11:0] Dec 01, 2016 10:44:51 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc8[JID:5:TAID:TID:ANID:ODID:1:1:15:0] Dec 01, 2016 10:45:19 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc3[JID:5:TAID:TID:ANID:ODID:1:1:4:0] *Dec 01, 2016 10:46:31 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run* *INFO: Executing: TaskComplete: [ss1120_nc1[JID:5:TAID:TID:ANID:ODID:1:1:0:0]* *Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run* *INFO: Executing: RegisterResultPartitionLocation: JobId@JID:5 ResultSetId@RSID:0 Partition@0 NPartitions@1 [email protected]:45470 <http://[email protected]:45470> OrderedResult@true EmptyResult@false* Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: ReportResultPartitionWriteCompletion: JobId@JID:5 ResultSetId@RSID:0 Partition@0 Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc6[JID:5:TAID:TID:ANID:ODID:4:0:0:0] Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: TaskComplete: [ss1120_nc6[JID:5:TAID:TID:ANID:ODID:1:1:10:0] Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: JobCleanup: JobId@JID:5 Status@TERMINATED Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.cc.work.JobCleanupWork run INFO: Cleanup for JobRun with id: JID:5 Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: JobletCleanupNotification Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: JobletCleanupNotification Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: JobletCleanupNotification Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: JobletCleanupNotification Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: JobletCleanupNotification Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: JobletCleanupNotification Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: JobletCleanupNotification Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: JobletCleanupNotification Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: JobletCleanupNotification Dec 02, 2016 12:30:19 AM org.apache.asterix.active.ActiveLifecycleListener notifyJobFinish INFO: NO NEED TO NOTIFY JOB FINISH! Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run INFO: Executing: GetResultPartitionLocations: JobId@JID:5 ResultSetId@RSID:0 Known@null Best, Taewoo On Thu, Dec 1, 2016 at 10:43 PM, Taewoo Kim <[email protected]> wrote: > @Ian: I have a separate CC on one node that doesn't have a NC. Yourkit > might be a good way to find the reason. Thanks. > > @Till: I think so. I am sending the same query now to see what happens > this time. > > Best, > Taewoo > > On Thu, Dec 1, 2016 at 10:41 PM, Till Westmann <[email protected]> wrote: > >> Hi Taewoo, >> >> is this behavior reproducible? >> >> Cheers, >> Till >> >> >> On 1 Dec 2016, at 22:14, Taewoo Kim wrote: >> >> PS: It took 2 more hours to finish the job on one NC. I wonder why this >>> happens. >>> >>> Dec 01, 2016 7:19:35 PM >>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run >>> INFO: Executing: NotifyTaskComplete >>> Dec 01, 2016 9:11:23 PM >>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run >>> INFO: Executing: CleanupJoblet >>> Dec 01, 2016 9:11:23 PM >>> org.apache.hyracks.control.nc.work.CleanupJobletWork run >>> INFO: Cleaning up after job: JID:4 >>> Dec 01, 2016 9:11:23 PM org.apache.hyracks.control.nc.Joblet close >>> WARNING: Freeing leaked 54919521 bytes >>> >>> Best, >>> Taewoo >>> >>> On Thu, Dec 1, 2016 at 8:39 PM, Taewoo Kim <[email protected]> wrote: >>> >>> Hi All, >>>> >>>> Have you experienced this case? >>>> >>>> I have 9 NCs and the CPU utilization of one NC shows 100% for 1 hour and >>>> 30 minutes while other NCs have finished their job about 1 hour ago. >>>> Even >>>> the problematic NC shows the following log at the end. So, looks like >>>> it's >>>> done but I'm not sure why this job never finishes. It's a simple hash >>>> join >>>> for 9M records on 9 nodes. >>>> >>>> Dec 01, 2016 7:18:02 PM org.apache.hyracks.control.com >>>> mon.work.WorkQueue$WorkerThread >>>> run >>>> INFO: Executing: NotifyTaskComplete >>>> >>>> Best, >>>> Taewoo >>>> >>>> >
