[me off at the moment, just seen this and wanted to quickly point out something 
about java errors]

http://stackoverflow.com/questions/11017304/catching-java-errors

see accepted response

thx

Alejandro
(phone typing)

> On Feb 24, 2014, at 10:43, Virag Kothari <[email protected]> wrote:
> 
> The coord materialization runnable always does the same task which is
> bringing coord jobs eligible for materialization on Oozie server.
> So, it shouldn't matter if the one of the task scheduled is lost due to an
> error or exception. If only one coord job stops materializing and not
> others, I am also thinking that you might be hitting OOZIE-1699. Do you
> see that a single job materialization works fine when there is an
> Exception but never proceeds when the runnable throws an Error?
> 
> Thanks,
> Virag
> 
>> On 2/24/14 10:16 AM, "Robert Kanter" <[email protected]> wrote:
>> 
>> Shwetha, while that sounds similar to the issue I described, and may have
>> a
>> similar solution, I don't think they're the same issue.  Your issue looks
>> like it has to do with some logic Oozie is using in CallableQueueService
>> where its not handling Exceptions properly in an edge case.  My issue has
>> to do with how the Java's ThreadPool handles Runnables that throw an
>> Error.
>> 
>> 
>> 
>>> On Mon, Feb 24, 2014 at 1:39 AM, Shwetha GS <[email protected]> wrote:
>>> 
>>> We had the same issue. We found that because of an edge case, the
>>> callable
>>> command(CoordMaterializeTransitionXCommand in this case which does the
>>> actual materialisation) is added to the unique callable(in
>>> CallableQueueService), but is never added to the queue. Hence the
>>> materialisation doesn't happen. Since
>>> CoordMaterializeTransitionXCommand is
>>> already in unique callable, further recovery
>>> from CoordMaterializeTriggerService doesn't enqueue any
>>> more CoordMaterializeTransitionXCommand. The issue is being fixed as
>>> part
>>> of https://issues.apache.org/jira/browse/OOZIE-1699. This issue can
>>> happen
>>> with any command(not just materialisation).
>>> 
>>> To verify if this is indeed the issue,
>>> 1. CoordMaterializeTriggerService should pick up this coord every 5
>>> mins(materialisation interval)
>>> 2. CoordMaterializeTransitionXCommand never runs
>>> 3. oozie admin -queuedump lists that coord for mater in unique callable
>>> 
>>> 
>>> 
>>> 
>>> On Sat, Feb 22, 2014 at 5:29 AM, Robert Kanter <[email protected]>
>>> wrote:
>>> 
>>>> Sorry for the long message, but its hard to explain the issue
>>> succinctly.
>>>> 
>>>> We recently saw a weird issue where Oozie suddenly stopped
>>> materializing
>>>> actions for a Coordinator job.  All we saw in the logs was this
>>> message
>>>> 2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService:
>>> USER[-]
>>>> GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor
>>>> [CoordJobsToBeMaterializedJPAExecutor] ended with an active
>>> transaction,
>>>> rolling back
>>>> 
>>>> After a bunch of digging, I figured out what was happening:
>>>> 
>>>> The CoordMaterializeTriggerService is responsible for materializing
>>> Coord
>>>> actions via a Runnable that gets scheduled every 5min via the
>>>> SchedulerService.  The SchedulerService is backed by a
>>>> ScheduledThreadPoolExecutor.  The Runnable does a bunch of things, but
>>> the
>>>> important thing is that it executes
>>> CoordJobsToBeMaterializedJPAExecutor
>>>> via the JPAService.  In the JPAService#execute(...) method, this
>>> happens:
>>>> try {
>>>>    if (em.getTransaction().isActive()) {
>>>>        LOG.warn("JPAExecutor [{0}] ended with an active transaction,
>>>> rolling back", executor.getName());
>>>>        *em.getTransaction().rollback();*
>>>>    }
>>>> }
>>>>    catch (Exception ex) {
>>>>        LOG.warn("Could not check/rollback transaction after
>>> JPAExecutor
>>>> [{0}], {1}", executor.getName(), ex
>>>>            .getMessage(), ex);
>>>> }
>>>> I've determined that the bold line (the rollback) can throw an Error
>>> (as
>>> in
>>>> the Throwable, not a synonym for Exception).  For obvious reasons,
>>> we're
>>>> not catching this Error anywhere.  This causes the Thread to exit;
>>> however,
>>>> we don't "lose" any threads because the ScheduledThreadPoolExecutor
>>> will
>>>> simply start a new one.  However, the task that the thread was
>>> executed
>>> is
>>>> dropped from the schedule.  According to the documentation [1] it says
>>> "If
>>>> any execution of the task encounters an exception, subsequent
>>> executions
>>>> are suppressed."  I did some testing, and that's not quite true.
>>>> Exceptions don't cause subsequent executions to be suppressed, but
>>> Errors
>>>> do.
>>>> 
>>>> I tried catching all Errors to figure out what was causing this and
>>>> reproduced the problem (or at least a similar problem by killing the
>>> MySQL
>>>> database) and got a stack trace like this (that I've shorted for
>>> brevity):
>>>> java.lang.ExceptionInInitializerError
>>>>        at
>>> 
>>> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl
>>> assGenerator.java:132)
>>>>        at
>>> 
>>> org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(Confi
>>> guringConnectionDecorator.java:110)
>>>>        at
>>> 
>>> org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataS
>>> ource.java:118)
>>>>        at
>>> 
>>> org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(Decorating
>>> DataSource.java:107)
>>>>        ...
>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>        at
>>> sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown
>>>> Source)
>>>>        at
>>> 
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConst
>>> ructorAccessorImpl.java:27)
>>>>        at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>>>        at
>>> 
>>> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl
>>> assGenerator.java:130)
>>>>        ... 28 more
>>>> Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>>>> Communications link failure
>>>> As we can see here [2], ConcreteClassGenerator.newInstance throws an
>>>> ExceptionInInitializerError when
>>> java.lang.reflect.Constructor.newInstance
>>>> throws any Exception; in this case, an InvocationTargetException,
>>> which
>>> it
>>>> throws if the underlying constructor in the reflection throws an
>>> Exception
>>>> [3].  And the underlying constructor is throwing a
>>> CommunicationsException
>>>> from the MySQL JDBC driver.
>>>> 
>>>> Long story short, in some cases scheduled tasks can throw an Error
>>> instead
>>>> of an Exception when there's a database-related problem.  And this
>>> causes
>>>> them to not be rescheduled.  It is also possible for this to happen to
>>>> other tasks being executed via the SchedulerService besides the
>>>> CoordMaterializeTriggerService, which would result in other missing
>>>> behaviors.
>>>> 
>>>> I also noticed that even when I caught the Error, it still didn't
>>>> reschedule the task -- not sure why...
>>>> 
>>>> Any thoughts or suggestions on the best way to recover from this
>>> situation?
>>>> 
>>>> Restarting Oozie works, but that's not ideal.
>>>> One idea I had was to add some kind of "watchdog" thread that would
>>> somehow
>>>> notice if a scheduled task hadn't executed in the proper interval +
>>> some
>>>> threshold and reschedule it.
>>>> My other, simper, idea is to try to manually reschedule a new
>>> identical
>>>> task in the Error catch block, which I'll try soon.
>>>> 
>>>> 
>>>> 
>>>> [1]
>>> 
>>> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledTh
>>> readPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%
>>> 20long,%20java.util.concurrent.TimeUnit)
>>>> [2]
>>> 
>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjp
>>> a-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#
>>> 145
>>>> [3]
>>> 
>>> http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.ht
>>> ml#newInstance(java.lang.Object
>>>> ..
>>>> .)
>>> 
>>> --
>>> _____________________________________________________________
>>> The information contained in this communication is intended solely for
>>> the
>>> use of the individual or entity to whom it is addressed and others
>>> authorized to receive it. It may contain confidential or legally
>>> privileged
>>> information. If you are not the intended recipient you are hereby
>>> notified
>>> that any disclosure, copying, distribution or taking any action in
>>> reliance
>>> on the contents of this information is strictly prohibited and may be
>>> unlawful. If you have received this communication in error, please
>>> notify
>>> us immediately by responding to this email and then delete it from your
>>> system. The firm is neither liable for the proper and complete
>>> transmission
>>> of the information contained in this communication nor for any delay in
>>> its
>>> receipt.
> 

Reply via email to