Sorry for the long message, but its hard to explain the issue succinctly.

We recently saw a weird issue where Oozie suddenly stopped materializing
actions for a Coordinator job.  All we saw in the logs was this message
2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService: USER[-]
GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor
[CoordJobsToBeMaterializedJPAExecutor] ended with an active transaction,
rolling back

After a bunch of digging, I figured out what was happening:

The CoordMaterializeTriggerService is responsible for materializing Coord
actions via a Runnable that gets scheduled every 5min via the
SchedulerService.  The SchedulerService is backed by a
ScheduledThreadPoolExecutor.  The Runnable does a bunch of things, but the
important thing is that it executes CoordJobsToBeMaterializedJPAExecutor
via the JPAService.  In the JPAService#execute(...) method, this happens:
try {
    if (em.getTransaction().isActive()) {
        LOG.warn("JPAExecutor [{0}] ended with an active transaction,
rolling back", executor.getName());
        *em.getTransaction().rollback();*
    }
}
    catch (Exception ex) {
        LOG.warn("Could not check/rollback transaction after JPAExecutor
[{0}], {1}", executor.getName(), ex
            .getMessage(), ex);
}
I've determined that the bold line (the rollback) can throw an Error (as in
the Throwable, not a synonym for Exception).  For obvious reasons, we're
not catching this Error anywhere.  This causes the Thread to exit; however,
we don't "lose" any threads because the ScheduledThreadPoolExecutor will
simply start a new one.  However, the task that the thread was executed is
dropped from the schedule.  According to the documentation [1] it says "If
any execution of the task encounters an exception, subsequent executions
are suppressed."  I did some testing, and that's not quite true.
 Exceptions don't cause subsequent executions to be suppressed, but Errors
do.

I tried catching all Errors to figure out what was causing this and
reproduced the problem (or at least a similar problem by killing the MySQL
database) and got a stack trace like this (that I've shorted for brevity):
java.lang.ExceptionInInitializerError
        at
org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteClassGenerator.java:132)
        at
org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(ConfiguringConnectionDecorator.java:110)
        at
org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataSource.java:118)
        at
org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(DecoratingDataSource.java:107)
        ...
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown
Source)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at
org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteClassGenerator.java:130)
        ... 28 more
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
Communications link failure
As we can see here [2], ConcreteClassGenerator.newInstance throws an
ExceptionInInitializerError when java.lang.reflect.Constructor.newInstance
throws any Exception; in this case, an InvocationTargetException, which it
throws if the underlying constructor in the reflection throws an Exception
[3].  And the underlying constructor is throwing a CommunicationsException
from the MySQL JDBC driver.

Long story short, in some cases scheduled tasks can throw an Error instead
of an Exception when there's a database-related problem.  And this causes
them to not be rescheduled.  It is also possible for this to happen to
other tasks being executed via the SchedulerService besides the
CoordMaterializeTriggerService, which would result in other missing
behaviors.

I also noticed that even when I caught the Error, it still didn't
reschedule the task -- not sure why...

Any thoughts or suggestions on the best way to recover from this situation?

Restarting Oozie works, but that's not ideal.
One idea I had was to add some kind of "watchdog" thread that would somehow
notice if a scheduled task hadn't executed in the proper interval + some
threshold and reschedule it.
My other, simper, idea is to try to manually reschedule a new identical
task in the Error catch block, which I'll try soon.



[1]
http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledThreadPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%20long,%20java.util.concurrent.TimeUnit)
[2]
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjpa-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#145
[3]
http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.html#newInstance(java.lang.Object..
.)

Reply via email to