Taewoo,
You can use the diagnostics end point (/admin/diagnostics) to look at all the 
stack traces from a single interface when that happens. This could give an idea 
on what is happening in such case.
Although, from what you described, it could be that we have some skewness 
during query execution? (could be nulls,missing? any special values?). That is 
also worth considering.

Trying to help without enough context :-). Cheers,
Abdullah.

> On Dec 2, 2016, at 10:22 AM, Taewoo Kim <[email protected]> wrote:
> 
> Additional note: @Till: Yes. It happened again for the same hash-join
> query. As we can see in the bold part of the following CC.log, one node
> alone was executing for two hours.
> 
> 
> Dec 01, 2016 10:41:56 PM
> org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner
> planActivityCluster
> INFO: Plan for org.apache.hyracks.api.job.ActivityCluster@383ecfdd
> Dec 01, 2016 10:41:56 PM
> org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner
> planActivityCluster
> INFO: Built 1 Task Clusters
> Dec 01, 2016 10:41:56 PM
> org.apache.hyracks.control.cc.scheduler.ActivityClusterPlanner
> planActivityCluster
> INFO: Tasks: [TID:ANID:ODID:1:1:0, TID:ANID:ODID:1:1:1,
> TID:ANID:ODID:1:1:2, TID:ANID:ODID:1:1:3, TID:ANID:ODID:1:1:4,
> TID:ANID:ODID:1:1:5, TID:ANID:ODID:1:1:6, TID:ANID:ODID:1:1:7,
> TID:ANID:ODID:1:1:8, TID:ANID:ODID:1:1:9, TID:ANID:ODID:1:1:10,
> TID:ANID:ODID:1:1:11, TID:ANID:ODID:1:1:12, TID:ANID:ODID:1:1:13,
> TID:ANID:ODID:1:1:14, TID:ANID:ODID:1:1:15, TID:ANID:ODID:1:1:16,
> TID:ANID:ODID:1:1:17, TID:ANID:ODID:4:0:0]
> Dec 01, 2016 10:43:18 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete: [ss1120_nc3[JID:5:TAID:TID:ANID:ODID:1:1:5:0]
> Dec 01, 2016 10:43:22 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete: [ss1120_nc4[JID:5:TAID:TID:ANID:ODID:1:1:7:0]
> Dec 01, 2016 10:43:23 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete:
> [ss1120_nc9[JID:5:TAID:TID:ANID:ODID:1:1:16:0]
> Dec 01, 2016 10:43:28 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete: [ss1120_nc2[JID:5:TAID:TID:ANID:ODID:1:1:2:0]
> Dec 01, 2016 10:43:31 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete: [ss1120_nc2[JID:5:TAID:TID:ANID:ODID:1:1:3:0]
> Dec 01, 2016 10:43:34 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete: [ss1120_nc5[JID:5:TAID:TID:ANID:ODID:1:1:8:0]
> Dec 01, 2016 10:43:40 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete:
> [ss1120_nc9[JID:5:TAID:TID:ANID:ODID:1:1:17:0]
> Dec 01, 2016 10:43:41 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete: [ss1120_nc4[JID:5:TAID:TID:ANID:ODID:1:1:6:0]
> Dec 01, 2016 10:43:49 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete:
> [ss1120_nc7[JID:5:TAID:TID:ANID:ODID:1:1:12:0]
> Dec 01, 2016 10:43:51 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete: [ss1120_nc1[JID:5:TAID:TID:ANID:ODID:1:1:1:0]
> Dec 01, 2016 10:43:53 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete: [ss1120_nc5[JID:5:TAID:TID:ANID:ODID:1:1:9:0]
> Dec 01, 2016 10:43:58 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete:
> [ss1120_nc7[JID:5:TAID:TID:ANID:ODID:1:1:13:0]
> Dec 01, 2016 10:44:25 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete:
> [ss1120_nc8[JID:5:TAID:TID:ANID:ODID:1:1:14:0]
> Dec 01, 2016 10:44:29 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete:
> [ss1120_nc6[JID:5:TAID:TID:ANID:ODID:1:1:11:0]
> Dec 01, 2016 10:44:51 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete:
> [ss1120_nc8[JID:5:TAID:TID:ANID:ODID:1:1:15:0]
> Dec 01, 2016 10:45:19 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete: [ss1120_nc3[JID:5:TAID:TID:ANID:ODID:1:1:4:0]
> *Dec 01, 2016 10:46:31 PM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run*
> *INFO: Executing: TaskComplete:
> [ss1120_nc1[JID:5:TAID:TID:ANID:ODID:1:1:0:0]*
> *Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run*
> *INFO: Executing: RegisterResultPartitionLocation: JobId@JID:5
> ResultSetId@RSID:0 Partition@0 NPartitions@1
> [email protected]:45470
> <http://[email protected]:45470> OrderedResult@true
> EmptyResult@false*
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: ReportResultPartitionWriteCompletion: JobId@JID:5
> ResultSetId@RSID:0 Partition@0
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete: [ss1120_nc6[JID:5:TAID:TID:ANID:ODID:4:0:0:0]
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: TaskComplete:
> [ss1120_nc6[JID:5:TAID:TID:ANID:ODID:1:1:10:0]
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: JobCleanup: JobId@JID:5 Status@TERMINATED
> Dec 02, 2016 12:30:19 AM org.apache.hyracks.control.cc.work.JobCleanupWork
> run
> INFO: Cleanup for JobRun with id: JID:5
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: JobletCleanupNotification
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: JobletCleanupNotification
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: JobletCleanupNotification
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: JobletCleanupNotification
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: JobletCleanupNotification
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: JobletCleanupNotification
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: JobletCleanupNotification
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: JobletCleanupNotification
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: JobletCleanupNotification
> Dec 02, 2016 12:30:19 AM org.apache.asterix.active.ActiveLifecycleListener
> notifyJobFinish
> INFO: NO NEED TO NOTIFY JOB FINISH!
> Dec 02, 2016 12:30:19 AM
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: GetResultPartitionLocations: JobId@JID:5 ResultSetId@RSID:0
> Known@null
> 
> Best,
> Taewoo
> 
> On Thu, Dec 1, 2016 at 10:43 PM, Taewoo Kim <[email protected]> wrote:
> 
>> @Ian: I have a separate CC on one node that doesn't have a NC. Yourkit
>> might be a good way to find the reason. Thanks.
>> 
>> @Till: I think so. I am sending the same query now to see what happens
>> this time.
>> 
>> Best,
>> Taewoo
>> 
>> On Thu, Dec 1, 2016 at 10:41 PM, Till Westmann <[email protected]> wrote:
>> 
>>> Hi Taewoo,
>>> 
>>> is this behavior reproducible?
>>> 
>>> Cheers,
>>> Till
>>> 
>>> 
>>> On 1 Dec 2016, at 22:14, Taewoo Kim wrote:
>>> 
>>> PS: It took 2 more hours to finish the job on one NC. I wonder why this
>>>> happens.
>>>> 
>>>> Dec 01, 2016 7:19:35 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: NotifyTaskComplete
>>>> Dec 01, 2016 9:11:23 PM
>>>> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
>>>> INFO: Executing: CleanupJoblet
>>>> Dec 01, 2016 9:11:23 PM
>>>> org.apache.hyracks.control.nc.work.CleanupJobletWork run
>>>> INFO: Cleaning up after job: JID:4
>>>> Dec 01, 2016 9:11:23 PM org.apache.hyracks.control.nc.Joblet close
>>>> WARNING: Freeing leaked 54919521 bytes
>>>> 
>>>> Best,
>>>> Taewoo
>>>> 
>>>> On Thu, Dec 1, 2016 at 8:39 PM, Taewoo Kim <[email protected]> wrote:
>>>> 
>>>> Hi All,
>>>>> 
>>>>> Have you experienced this case?
>>>>> 
>>>>> I have 9 NCs and the CPU utilization of one NC shows 100% for 1 hour and
>>>>> 30 minutes while other NCs have finished their job about 1 hour ago.
>>>>> Even
>>>>> the problematic NC shows the following log at the end. So, looks like
>>>>> it's
>>>>> done but I'm not sure why this job never finishes. It's a simple hash
>>>>> join
>>>>> for 9M records on 9 nodes.
>>>>> 
>>>>> Dec 01, 2016 7:18:02 PM org.apache.hyracks.control.com
>>>>> mon.work.WorkQueue$WorkerThread
>>>>> run
>>>>> INFO: Executing: NotifyTaskComplete
>>>>> 
>>>>> Best,
>>>>> Taewoo
>>>>> 
>>>>> 
>> 

Reply via email to