We had the same issue. We found that because of an edge case, the callable command(CoordMaterializeTransitionXCommand in this case which does the actual materialisation) is added to the unique callable(in CallableQueueService), but is never added to the queue. Hence the materialisation doesn't happen. Since CoordMaterializeTransitionXCommand is already in unique callable, further recovery from CoordMaterializeTriggerService doesn't enqueue any more CoordMaterializeTransitionXCommand. The issue is being fixed as part of https://issues.apache.org/jira/browse/OOZIE-1699. This issue can happen with any command(not just materialisation).
To verify if this is indeed the issue, 1. CoordMaterializeTriggerService should pick up this coord every 5 mins(materialisation interval) 2. CoordMaterializeTransitionXCommand never runs 3. oozie admin -queuedump lists that coord for mater in unique callable On Sat, Feb 22, 2014 at 5:29 AM, Robert Kanter <[email protected]> wrote: > Sorry for the long message, but its hard to explain the issue succinctly. > > We recently saw a weird issue where Oozie suddenly stopped materializing > actions for a Coordinator job. All we saw in the logs was this message > 2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService: USER[-] > GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor > [CoordJobsToBeMaterializedJPAExecutor] ended with an active transaction, > rolling back > > After a bunch of digging, I figured out what was happening: > > The CoordMaterializeTriggerService is responsible for materializing Coord > actions via a Runnable that gets scheduled every 5min via the > SchedulerService. The SchedulerService is backed by a > ScheduledThreadPoolExecutor. The Runnable does a bunch of things, but the > important thing is that it executes CoordJobsToBeMaterializedJPAExecutor > via the JPAService. In the JPAService#execute(...) method, this happens: > try { > if (em.getTransaction().isActive()) { > LOG.warn("JPAExecutor [{0}] ended with an active transaction, > rolling back", executor.getName()); > *em.getTransaction().rollback();* > } > } > catch (Exception ex) { > LOG.warn("Could not check/rollback transaction after JPAExecutor > [{0}], {1}", executor.getName(), ex > .getMessage(), ex); > } > I've determined that the bold line (the rollback) can throw an Error (as in > the Throwable, not a synonym for Exception). For obvious reasons, we're > not catching this Error anywhere. This causes the Thread to exit; however, > we don't "lose" any threads because the ScheduledThreadPoolExecutor will > simply start a new one. However, the task that the thread was executed is > dropped from the schedule. According to the documentation [1] it says "If > any execution of the task encounters an exception, subsequent executions > are suppressed." I did some testing, and that's not quite true. > Exceptions don't cause subsequent executions to be suppressed, but Errors > do. > > I tried catching all Errors to figure out what was causing this and > reproduced the problem (or at least a similar problem by killing the MySQL > database) and got a stack trace like this (that I've shorted for brevity): > java.lang.ExceptionInInitializerError > at > > org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteClassGenerator.java:132) > at > > org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(ConfiguringConnectionDecorator.java:110) > at > > org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataSource.java:118) > at > > org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(DecoratingDataSource.java:107) > ... > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown > Source) > at > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > > org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteClassGenerator.java:130) > ... 28 more > Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: > Communications link failure > As we can see here [2], ConcreteClassGenerator.newInstance throws an > ExceptionInInitializerError when java.lang.reflect.Constructor.newInstance > throws any Exception; in this case, an InvocationTargetException, which it > throws if the underlying constructor in the reflection throws an Exception > [3]. And the underlying constructor is throwing a CommunicationsException > from the MySQL JDBC driver. > > Long story short, in some cases scheduled tasks can throw an Error instead > of an Exception when there's a database-related problem. And this causes > them to not be rescheduled. It is also possible for this to happen to > other tasks being executed via the SchedulerService besides the > CoordMaterializeTriggerService, which would result in other missing > behaviors. > > I also noticed that even when I caught the Error, it still didn't > reschedule the task -- not sure why... > > Any thoughts or suggestions on the best way to recover from this situation? > > Restarting Oozie works, but that's not ideal. > One idea I had was to add some kind of "watchdog" thread that would somehow > notice if a scheduled task hadn't executed in the proper interval + some > threshold and reschedule it. > My other, simper, idea is to try to manually reschedule a new identical > task in the Error catch block, which I'll try soon. > > > > [1] > > http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit) > [2] > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjpa-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#145 > [3] > > http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.html#newInstance(java.lang.Object > .. > .) > -- _____________________________________________________________ The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.
