[ 
https://issues.apache.org/jira/browse/OOZIE-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated OOZIE-1051:
---------------------------------

    Attachment: OOZIE-1051.patch

The patch makes the suggested change from the description.  I also made a unit 
test to verify that the fix works (it fails without the fix and passes with 
it).  It would be good to test this with an actual deployment, but I'm not sure 
how to get Oozie into this state.  

One other thing:
Before trying to release the lock, it checks if the lock is null; and the only 
place it sets the lock to something is when it acquires it.  So, it seems to be 
that the only way for the command to get into this state where it tries to 
release a lock it didn't acquire would be if it was executed once with 
interrupt mode disabled (where it acquired and released the lock correctly) and 
then is executed again with interrupt mode enabled (where it still has a 
reference to the lock from last time, but didn't actually acquire it this time 
when it tries to release it) -- this is essentially what my unit test does.  
With this in mind, another possible solution is to set the lock back to null 
after releasing it (see below).  We could also do both.  Thoughts?

{code}
private void releaseLock() {
    if (lock != null) {
        lock.release();
        lock = null;   // <--- add this here
        LOG.debug("Released lock for [{0}] in [{1}]", getEntityKey(), 
getName());
    }
}
{code}
                
> Repeating Errors for workflows that were allreday Killed
> --------------------------------------------------------
>
>                 Key: OOZIE-1051
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1051
>             Project: Oozie
>          Issue Type: Bug
>          Components: workflow
>    Affects Versions: 3.2.0
>            Reporter: Boris Shulman
>            Assignee: Robert Kanter
>         Attachments: OOZIE-1051.patch
>
>
>  am experiencing a repeated errors for workflows that were already killed.
> {code}
> 012-11-05 21:01:27,975 DEBUG KillXCommand:545 - USER[isotopeuser] GROUP[-] 
> TOKEN[] APP[MonitoringTestWorkflow1__0] 
> JOB[0000713-121031144215857-oozie-isot-W] ACTION[-] Acquired lock for 
> [0000712-121031144215857-oozie-isot-W] in [kill]
> 2012-11-05 21:01:27,975 DEBUG KillXCommand:545 - USER[isotopeuser] GROUP[-] 
> TOKEN[] APP[MonitoringTestWorkflow1__0] 
> JOB[0000713-121031144215857-oozie-isot-W] ACTION[-] Load state for 
> [0000712-121031144215857-oozie-isot-W]
> 2012-11-05 21:01:28,475 DEBUG KillXCommand:545 - USER[isotopeuser] GROUP[-] 
> TOKEN[] APP[MonitoringTestWorkflow1__0] 
> JOB[0000712-121031144215857-oozie-isot-W] ACTION[-] Precondition check for 
> command [kill] key [0000712-121031144215857-oozie-isot-W]
> 2012-11-05 21:01:28,475 DEBUG KillXCommand:545 - USER[isotopeuser] GROUP[-] 
> TOKEN[] APP[MonitoringTestWorkflow1__0] 
> JOB[0000712-121031144215857-oozie-isot-W] ACTION[-] Released lock for 
> [0000712-121031144215857-oozie-isot-W] in [kill]
> 2012-11-05 21:01:28,475  WARN KillXCommand:542 - USER[isotopeuser] GROUP[-] 
> TOKEN[] APP[MonitoringTestWorkflow1__0] 
> JOB[0000712-121031144215857-oozie-isot-W] ACTION[-] E0725: Workflow instance 
> can not be killed, 0000712-121031144215857-oozie-isot-W, Error Code: E0725
> 2012-11-05 21:01:28,475 DEBUG KillXCommand:545 - USER[isotopeuser] GROUP[-] 
> TOKEN[] APP[MonitoringTestWorkflow1__0] 
> JOB[0000712-121031144215857-oozie-isot-W] ACTION[-] Command [kill] key 
> [0000712-121031144215857-oozie-isot-W]  already executed for 
> [org.apache.oozie.command.wf.KillXCommand@6903ada0]
> 2012-11-05 21:01:28,475 ERROR KillXCommand:536 - USER[isotopeuser] GROUP[-] 
> TOKEN[] APP[MonitoringTestWorkflow1__0] 
> JOB[0000712-121031144215857-oozie-isot-W] ACTION[-] Exception, 
> java.lang.IllegalMonitorStateException
>       at 
> java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryRelease(ReentrantReadWriteLock.java:374)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1260)
>       at 
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.unlock(ReentrantReadWriteLock.java:1131)
>       at 
> org.apache.oozie.util.MemoryLocks$LockToken.release(MemoryLocks.java:59)
>       at org.apache.oozie.command.XCommand.releaseLock(XCommand.java:222)
>       at org.apache.oozie.command.XCommand.call(XCommand.java:295)
>       at 
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:326)
>       at 
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:255)
>       at 
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>       at java.lang.Thread.run(Thread.java:722)
> 2012-11-05 21:01:28,475  WARN CallableQueueService$CompositeCallable:542 - 
> USER[-] GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] exception callable [kill], 
> E0607: Other error in operation [java.lang.IllegalMonitorStateException], {1}
> org.apache.oozie.command.CommandException: E0607: Other error in operation 
> [java.lang.IllegalMonitorStateException], {1}
>       at org.apache.oozie.command.XCommand.call(XCommand.java:317)
>       at 
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:326)
>       at 
> org.apache.oozie.service.CallableQueueService$CompositeCallable.call(CallableQueueService.java:255)
>       at 
> org.apache.oozie.service.CallableQueueService$CallableWrapper.run(CallableQueueService.java:175)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>       at java.lang.Thread.run(Thread.java:722)
> Caused by: java.lang.IllegalMonitorStateException
>       at 
> java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryRelease(ReentrantReadWriteLock.java:374)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1260)
>       at 
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.unlock(ReentrantReadWriteLock.java:1131)
>       at 
> org.apache.oozie.util.MemoryLocks$LockToken.release(MemoryLocks.java:59)
>       at org.apache.oozie.command.XCommand.releaseLock(XCommand.java:222)
>       at org.apache.oozie.command.XCommand.call(XCommand.java:295)
>       ... 6 more
> {code}
> What I did, is some time ago I killed a coordinator app whic had a lot of 
> workflows in WAITING state, since then I see this error in the log file.
> The log/exception is shown periodically, and it happen for the same job id 
> (0000712-121031144215857-oozie-isot-W) and also several other job ids.
> According to Mohammad Islam:
> This is definitely a bug in XCommand class.
> Most possibly, the bug was introduced when interrupt type command was added.
> Would you please create a JIRA for this adding the contents of this email?
> Probable reason:
> Trying to release a lock w/o locking it first. It could happen on special 
> case.
> The bug is here in XCommand code:
> {code}
> finally{
>     if (isLockRequired()) {
>         releaseLock();
>     }
> }
> {code}
> The "if" condition should be changed to something like this (used before 
> locking):
> if (isLockRequired() && !this.inInterruptMode())  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to