It looks like OOZIE-1699 fixes the problem!

thanks for pointing that out.


On Mon, Feb 24, 2014 at 10:58 AM, Robert Kanter <[email protected]>wrote:

> Do you see that a single job materialization works fine when there is an
>> Exception but never proceeds when the runnable throws an Error?
>
> Yes; though it did seem funny that it currently works fine for Exceptions
> but not Errors -- this is because of the issue Shwetha described/fixed in
> OOZIE-1699?  I'll try checking if that patch fixes the issue we're seeing.
>
> just seen this and wanted to quickly point out something about java errors]
>> http://stackoverflow.com/questions/11017304/catching-java-errors
>> see accepted response
>
> Yup, I saw that.  I don't think we need to worry about all types of
> Errors, just the specific one (ExceptionInInitializerError) that OpenJPA is
> throwing when the MySQL driver is throwing an Exception.  In this case, we
> can recover from this Error by basically just ignoring it for now
> and "hoping" it works the next time the task is run.  See
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjpa-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#149
>
>
>
> On Mon, Feb 24, 2014 at 10:49 AM, Alejandro Abdelnur <[email protected]>wrote:
>
>> [me off at the moment, just seen this and wanted to quickly point out
>> something about java errors]
>>
>> http://stackoverflow.com/questions/11017304/catching-java-errors
>>
>> see accepted response
>>
>> thx
>>
>> Alejandro
>> (phone typing)
>>
>> > On Feb 24, 2014, at 10:43, Virag Kothari <[email protected]> wrote:
>> >
>> > The coord materialization runnable always does the same task which is
>> > bringing coord jobs eligible for materialization on Oozie server.
>> > So, it shouldn't matter if the one of the task scheduled is lost due to
>> an
>> > error or exception. If only one coord job stops materializing and not
>> > others, I am also thinking that you might be hitting OOZIE-1699. Do you
>> > see that a single job materialization works fine when there is an
>> > Exception but never proceeds when the runnable throws an Error?
>> >
>> > Thanks,
>> > Virag
>> >
>> >> On 2/24/14 10:16 AM, "Robert Kanter" <[email protected]> wrote:
>> >>
>> >> Shwetha, while that sounds similar to the issue I described, and may
>> have
>> >> a
>> >> similar solution, I don't think they're the same issue.  Your issue
>> looks
>> >> like it has to do with some logic Oozie is using in
>> CallableQueueService
>> >> where its not handling Exceptions properly in an edge case.  My issue
>> has
>> >> to do with how the Java's ThreadPool handles Runnables that throw an
>> >> Error.
>> >>
>> >>
>> >>
>> >>> On Mon, Feb 24, 2014 at 1:39 AM, Shwetha GS <[email protected]>
>> wrote:
>> >>>
>> >>> We had the same issue. We found that because of an edge case, the
>> >>> callable
>> >>> command(CoordMaterializeTransitionXCommand in this case which does the
>> >>> actual materialisation) is added to the unique callable(in
>> >>> CallableQueueService), but is never added to the queue. Hence the
>> >>> materialisation doesn't happen. Since
>> >>> CoordMaterializeTransitionXCommand is
>> >>> already in unique callable, further recovery
>> >>> from CoordMaterializeTriggerService doesn't enqueue any
>> >>> more CoordMaterializeTransitionXCommand. The issue is being fixed as
>> >>> part
>> >>> of https://issues.apache.org/jira/browse/OOZIE-1699. This issue can
>> >>> happen
>> >>> with any command(not just materialisation).
>> >>>
>> >>> To verify if this is indeed the issue,
>> >>> 1. CoordMaterializeTriggerService should pick up this coord every 5
>> >>> mins(materialisation interval)
>> >>> 2. CoordMaterializeTransitionXCommand never runs
>> >>> 3. oozie admin -queuedump lists that coord for mater in unique
>> callable
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Feb 22, 2014 at 5:29 AM, Robert Kanter <[email protected]>
>> >>> wrote:
>> >>>
>> >>>> Sorry for the long message, but its hard to explain the issue
>> >>> succinctly.
>> >>>>
>> >>>> We recently saw a weird issue where Oozie suddenly stopped
>> >>> materializing
>> >>>> actions for a Coordinator job.  All we saw in the logs was this
>> >>> message
>> >>>> 2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService:
>> >>> USER[-]
>> >>>> GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor
>> >>>> [CoordJobsToBeMaterializedJPAExecutor] ended with an active
>> >>> transaction,
>> >>>> rolling back
>> >>>>
>> >>>> After a bunch of digging, I figured out what was happening:
>> >>>>
>> >>>> The CoordMaterializeTriggerService is responsible for materializing
>> >>> Coord
>> >>>> actions via a Runnable that gets scheduled every 5min via the
>> >>>> SchedulerService.  The SchedulerService is backed by a
>> >>>> ScheduledThreadPoolExecutor.  The Runnable does a bunch of things,
>> but
>> >>> the
>> >>>> important thing is that it executes
>> >>> CoordJobsToBeMaterializedJPAExecutor
>> >>>> via the JPAService.  In the JPAService#execute(...) method, this
>> >>> happens:
>> >>>> try {
>> >>>>    if (em.getTransaction().isActive()) {
>> >>>>        LOG.warn("JPAExecutor [{0}] ended with an active transaction,
>> >>>> rolling back", executor.getName());
>> >>>>        *em.getTransaction().rollback();*
>> >>>>    }
>> >>>> }
>> >>>>    catch (Exception ex) {
>> >>>>        LOG.warn("Could not check/rollback transaction after
>> >>> JPAExecutor
>> >>>> [{0}], {1}", executor.getName(), ex
>> >>>>            .getMessage(), ex);
>> >>>> }
>> >>>> I've determined that the bold line (the rollback) can throw an Error
>> >>> (as
>> >>> in
>> >>>> the Throwable, not a synonym for Exception).  For obvious reasons,
>> >>> we're
>> >>>> not catching this Error anywhere.  This causes the Thread to exit;
>> >>> however,
>> >>>> we don't "lose" any threads because the ScheduledThreadPoolExecutor
>> >>> will
>> >>>> simply start a new one.  However, the task that the thread was
>> >>> executed
>> >>> is
>> >>>> dropped from the schedule.  According to the documentation [1] it
>> says
>> >>> "If
>> >>>> any execution of the task encounters an exception, subsequent
>> >>> executions
>> >>>> are suppressed."  I did some testing, and that's not quite true.
>> >>>> Exceptions don't cause subsequent executions to be suppressed, but
>> >>> Errors
>> >>>> do.
>> >>>>
>> >>>> I tried catching all Errors to figure out what was causing this and
>> >>>> reproduced the problem (or at least a similar problem by killing the
>> >>> MySQL
>> >>>> database) and got a stack trace like this (that I've shorted for
>> >>> brevity):
>> >>>> java.lang.ExceptionInInitializerError
>> >>>>        at
>> >>>
>> >>>
>> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl
>> >>> assGenerator.java:132)
>> >>>>        at
>> >>>
>> >>>
>> org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(Confi
>> >>> guringConnectionDecorator.java:110)
>> >>>>        at
>> >>>
>> >>>
>> org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataS
>> >>> ource.java:118)
>> >>>>        at
>> >>>
>> >>>
>> org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(Decorating
>> >>> DataSource.java:107)
>> >>>>        ...
>> >>>> Caused by: java.lang.reflect.InvocationTargetException
>> >>>>        at
>> >>> sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown
>> >>>> Source)
>> >>>>        at
>> >>>
>> >>>
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConst
>> >>> ructorAccessorImpl.java:27)
>> >>>>        at
>> >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>> >>>>        at
>> >>>
>> >>>
>> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl
>> >>> assGenerator.java:130)
>> >>>>        ... 28 more
>> >>>> Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>> >>>> Communications link failure
>> >>>> As we can see here [2], ConcreteClassGenerator.newInstance throws an
>> >>>> ExceptionInInitializerError when
>> >>> java.lang.reflect.Constructor.newInstance
>> >>>> throws any Exception; in this case, an InvocationTargetException,
>> >>> which
>> >>> it
>> >>>> throws if the underlying constructor in the reflection throws an
>> >>> Exception
>> >>>> [3].  And the underlying constructor is throwing a
>> >>> CommunicationsException
>> >>>> from the MySQL JDBC driver.
>> >>>>
>> >>>> Long story short, in some cases scheduled tasks can throw an Error
>> >>> instead
>> >>>> of an Exception when there's a database-related problem.  And this
>> >>> causes
>> >>>> them to not be rescheduled.  It is also possible for this to happen
>> to
>> >>>> other tasks being executed via the SchedulerService besides the
>> >>>> CoordMaterializeTriggerService, which would result in other missing
>> >>>> behaviors.
>> >>>>
>> >>>> I also noticed that even when I caught the Error, it still didn't
>> >>>> reschedule the task -- not sure why...
>> >>>>
>> >>>> Any thoughts or suggestions on the best way to recover from this
>> >>> situation?
>> >>>>
>> >>>> Restarting Oozie works, but that's not ideal.
>> >>>> One idea I had was to add some kind of "watchdog" thread that would
>> >>> somehow
>> >>>> notice if a scheduled task hadn't executed in the proper interval +
>> >>> some
>> >>>> threshold and reschedule it.
>> >>>> My other, simper, idea is to try to manually reschedule a new
>> >>> identical
>> >>>> task in the Error catch block, which I'll try soon.
>> >>>>
>> >>>>
>> >>>>
>> >>>> [1]
>> >>>
>> >>>
>> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledTh
>> >>>
>> readPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,%
>> >>> 20long,%20java.util.concurrent.TimeUnit)
>> >>>> [2]
>> >>>
>> >>>
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjp
>> >>>
>> a-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#
>> >>> 145
>> >>>> [3]
>> >>>
>> >>>
>> http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.ht
>> >>> ml#newInstance(java.lang.Object
>> >>>> ..
>> >>>> .)
>> >>>
>> >>> --
>> >>> _____________________________________________________________
>> >>> The information contained in this communication is intended solely for
>> >>> the
>> >>> use of the individual or entity to whom it is addressed and others
>> >>> authorized to receive it. It may contain confidential or legally
>> >>> privileged
>> >>> information. If you are not the intended recipient you are hereby
>> >>> notified
>> >>> that any disclosure, copying, distribution or taking any action in
>> >>> reliance
>> >>> on the contents of this information is strictly prohibited and may be
>> >>> unlawful. If you have received this communication in error, please
>> >>> notify
>> >>> us immediately by responding to this email and then delete it from
>> your
>> >>> system. The firm is neither liable for the proper and complete
>> >>> transmission
>> >>> of the information contained in this communication nor for any delay
>> in
>> >>> its
>> >>> receipt.
>> >
>>
>
>

Reply via email to