[ 
https://issues.apache.org/jira/browse/ASTERIXDB-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16401211#comment-16401211
 ] 

Murtadha Hubail commented on ASTERIXDB-2185:
--------------------------------------------

[~wangsaeu],

There is no indication in the logs that the NC failed to send the task failure 
notification. As a matter of fact, the logs look normal. There are three tasks 
to be aborted, so it is expected to see these logs repeated but with a 
different task id. Also, At the end of the logs, the Joblet close is logged, so 
the CC received the tasks failure notifications and instructed the NCs to do 
the clean up. Do you have the CC logs that show the cluster going to UNUSABLE? 
All of this could've happened due to an NC failing to send heartbeat or losing 
connection with the CC. This will result in the cluster state becoming UNUSABLE 
and the job being aborted.

> Cluster becomes UNUSABLE status after a NC fails to send a job failure.
> -----------------------------------------------------------------------
>
>                 Key: ASTERIXDB-2185
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2185
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: IDX - Indexes, RT - Runtime
>            Reporter: Taewoo Kim
>            Assignee: Murtadha Hubail
>            Priority: Major
>              Labels: triaged
>
> A cluster became UNUSABLE status after a NC failed to send a job failure 
> message. See the exception below.
> {code}
> Dec 03, 2017 6:47:13 PM org.apache.hyracks.control.nc.work.StartTasksWork run
> INFO: Initializing TAID:TID:ANID:ODID:16:0:1:0 -> [Asterix {
>   ets;
>   assign [0, 1, 2] := [Constant, Constant, Constant];
> }, 
> org.apache.hyracks.storage.am.lsm.invertedindex.dataflow.LSMInvertedIndexSearchOperatorDescriptor@23d902c1,
>  org.apache.hyracks.dataflow.std.sort.ExternalSort
> OperatorDescriptor$1@2fc09944]
> Dec 03, 2017 6:47:13 PM 
> org.apache.hyracks.dataflow.std.sort.AbstractSorterOperatorDescriptor$SortActivity$1
>  close
> INFO: InitialNumberOfRuns:0
> Dec 03, 2017 6:47:13 PM 
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: NotifyTaskCompleteWork:TAID:TID:ANID:ODID:13:0:1:0
> Dec 03, 2017 6:47:13 PM 
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: NotifyTaskCompleteWork:TAID:TID:ANID:ODID:13:0:0:0
> Dec 03, 2017 6:47:13 PM 
> org.apache.hyracks.dataflow.std.sort.AbstractSorterOperatorDescriptor$SortActivity$1
>  close
> INFO: InitialNumberOfRuns:0
> Dec 03, 2017 6:47:13 PM 
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: NotifyTaskCompleteWork:TAID:TID:ANID:ODID:16:0:0:0
> Dec 03, 2017 6:47:13 PM 
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: NotifyTaskCompleteWork:TAID:TID:ANID:ODID:16:0:1:0
> Dec 03, 2017 6:48:02 PM 
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: AbortTasks
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.nc.work.AbortTasksWork run
> INFO: Aborting Tasks: JID:0:[TAID:TID:ANID:ODID:0:0:0:0, 
> TAID:TID:ANID:ODID:3:0:0:0, TAID:TID:ANID:ODID:3:0:1:0]
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.nc.Task run
> WARNING: Task TAID:TID:ANID:ODID:3:0:0:0 failed with exception
> java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
>       at java.util.concurrent.Semaphore.acquire(Semaphore.java:467)
>       at org.apache.hyracks.control.nc.Task.run(Task.java:325)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:744)
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.nc.Task run
> WARNING: Task TAID:TID:ANID:ODID:3:0:1:0 failed with exception
> java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
>       at java.util.concurrent.Semaphore.acquire(Semaphore.java:467)
>       at org.apache.hyracks.control.nc.Task.run(Task.java:325)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:744)
> Dec 03, 2017 6:48:02 PM 
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: NotifyTaskFailure
> Dec 03, 2017 6:48:02 PM 
> org.apache.hyracks.control.nc.work.NotifyTaskFailureWork run
> WARNING: 1 is sending a notification to cc that task 
> TAID:TID:ANID:ODID:3:0:0:0 has failed
> org.apache.hyracks.api.exceptions.HyracksDataException: HYR0003: 
> java.lang.InterruptedException
>       at 
> org.apache.hyracks.control.common.utils.ExceptionUtils.setNodeIds(ExceptionUtils.java:68)
>       at org.apache.hyracks.control.nc.Task.run(Task.java:367)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
>       at java.util.concurrent.Semaphore.acquire(Semaphore.java:467)
>       at org.apache.hyracks.control.nc.Task.run(Task.java:325)
>       ... 3 more
>       
>       
> ...... Same exception was repeated for several times ......
> Dec 03, 2017 6:48:02 PM 
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: NotifyTaskFailure
> Dec 03, 2017 6:48:02 PM 
> org.apache.hyracks.control.nc.work.NotifyTaskFailureWork run
> WARNING: 1 is sending a notification to cc that task 
> TAID:TID:ANID:ODID:3:0:0:0 has failed
> org.apache.hyracks.api.exceptions.HyracksDataException: HYR0003: 
> java.lang.InterruptedException
>       at 
> org.apache.hyracks.control.common.utils.ExceptionUtils.setNodeIds(ExceptionUtils.java:68)
>       at org.apache.hyracks.control.nc.Task.run(Task.java:367)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.InterruptedException
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
>       at java.util.concurrent.Semaphore.acquire(Semaphore.java:467)
>       at org.apache.hyracks.control.nc.Task.run(Task.java:325)
>       ... 3 more
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.nc.Joblet close
> WARNING: Freeing leaked 458752 bytes  
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to