It looks like OOZIE-1699 fixes the problem! thanks for pointing that out.
On Mon, Feb 24, 2014 at 10:58 AM, Robert Kanter <[email protected]>wrote: > Do you see that a single job materialization works fine when there is an >> Exception but never proceeds when the runnable throws an Error? > > Yes; though it did seem funny that it currently works fine for Exceptions > but not Errors -- this is because of the issue Shwetha described/fixed in > OOZIE-1699? I'll try checking if that patch fixes the issue we're seeing. > > just seen this and wanted to quickly point out something about java errors] >> http://stackoverflow.com/questions/11017304/catching-java-errors >> see accepted response > > Yup, I saw that. I don't think we need to worry about all types of > Errors, just the specific one (ExceptionInInitializerError) that OpenJPA is > throwing when the MySQL driver is throwing an Exception. In this case, we > can recover from this Error by basically just ignoring it for now > and "hoping" it works the next time the task is run. See > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjpa-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f#149 > > > > On Mon, Feb 24, 2014 at 10:49 AM, Alejandro Abdelnur <[email protected]>wrote: > >> [me off at the moment, just seen this and wanted to quickly point out >> something about java errors] >> >> http://stackoverflow.com/questions/11017304/catching-java-errors >> >> see accepted response >> >> thx >> >> Alejandro >> (phone typing) >> >> > On Feb 24, 2014, at 10:43, Virag Kothari <[email protected]> wrote: >> > >> > The coord materialization runnable always does the same task which is >> > bringing coord jobs eligible for materialization on Oozie server. >> > So, it shouldn't matter if the one of the task scheduled is lost due to >> an >> > error or exception. If only one coord job stops materializing and not >> > others, I am also thinking that you might be hitting OOZIE-1699. Do you >> > see that a single job materialization works fine when there is an >> > Exception but never proceeds when the runnable throws an Error? >> > >> > Thanks, >> > Virag >> > >> >> On 2/24/14 10:16 AM, "Robert Kanter" <[email protected]> wrote: >> >> >> >> Shwetha, while that sounds similar to the issue I described, and may >> have >> >> a >> >> similar solution, I don't think they're the same issue. Your issue >> looks >> >> like it has to do with some logic Oozie is using in >> CallableQueueService >> >> where its not handling Exceptions properly in an edge case. My issue >> has >> >> to do with how the Java's ThreadPool handles Runnables that throw an >> >> Error. >> >> >> >> >> >> >> >>> On Mon, Feb 24, 2014 at 1:39 AM, Shwetha GS <[email protected]> >> wrote: >> >>> >> >>> We had the same issue. We found that because of an edge case, the >> >>> callable >> >>> command(CoordMaterializeTransitionXCommand in this case which does the >> >>> actual materialisation) is added to the unique callable(in >> >>> CallableQueueService), but is never added to the queue. Hence the >> >>> materialisation doesn't happen. Since >> >>> CoordMaterializeTransitionXCommand is >> >>> already in unique callable, further recovery >> >>> from CoordMaterializeTriggerService doesn't enqueue any >> >>> more CoordMaterializeTransitionXCommand. The issue is being fixed as >> >>> part >> >>> of https://issues.apache.org/jira/browse/OOZIE-1699. This issue can >> >>> happen >> >>> with any command(not just materialisation). >> >>> >> >>> To verify if this is indeed the issue, >> >>> 1. CoordMaterializeTriggerService should pick up this coord every 5 >> >>> mins(materialisation interval) >> >>> 2. CoordMaterializeTransitionXCommand never runs >> >>> 3. oozie admin -queuedump lists that coord for mater in unique >> callable >> >>> >> >>> >> >>> >> >>> >> >>> On Sat, Feb 22, 2014 at 5:29 AM, Robert Kanter <[email protected]> >> >>> wrote: >> >>> >> >>>> Sorry for the long message, but its hard to explain the issue >> >>> succinctly. >> >>>> >> >>>> We recently saw a weird issue where Oozie suddenly stopped >> >>> materializing >> >>>> actions for a Coordinator job. All we saw in the logs was this >> >>> message >> >>>> 2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService: >> >>> USER[-] >> >>>> GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor >> >>>> [CoordJobsToBeMaterializedJPAExecutor] ended with an active >> >>> transaction, >> >>>> rolling back >> >>>> >> >>>> After a bunch of digging, I figured out what was happening: >> >>>> >> >>>> The CoordMaterializeTriggerService is responsible for materializing >> >>> Coord >> >>>> actions via a Runnable that gets scheduled every 5min via the >> >>>> SchedulerService. The SchedulerService is backed by a >> >>>> ScheduledThreadPoolExecutor. The Runnable does a bunch of things, >> but >> >>> the >> >>>> important thing is that it executes >> >>> CoordJobsToBeMaterializedJPAExecutor >> >>>> via the JPAService. In the JPAService#execute(...) method, this >> >>> happens: >> >>>> try { >> >>>> if (em.getTransaction().isActive()) { >> >>>> LOG.warn("JPAExecutor [{0}] ended with an active transaction, >> >>>> rolling back", executor.getName()); >> >>>> *em.getTransaction().rollback();* >> >>>> } >> >>>> } >> >>>> catch (Exception ex) { >> >>>> LOG.warn("Could not check/rollback transaction after >> >>> JPAExecutor >> >>>> [{0}], {1}", executor.getName(), ex >> >>>> .getMessage(), ex); >> >>>> } >> >>>> I've determined that the bold line (the rollback) can throw an Error >> >>> (as >> >>> in >> >>>> the Throwable, not a synonym for Exception). For obvious reasons, >> >>> we're >> >>>> not catching this Error anywhere. This causes the Thread to exit; >> >>> however, >> >>>> we don't "lose" any threads because the ScheduledThreadPoolExecutor >> >>> will >> >>>> simply start a new one. However, the task that the thread was >> >>> executed >> >>> is >> >>>> dropped from the schedule. According to the documentation [1] it >> says >> >>> "If >> >>>> any execution of the task encounters an exception, subsequent >> >>> executions >> >>>> are suppressed." I did some testing, and that's not quite true. >> >>>> Exceptions don't cause subsequent executions to be suppressed, but >> >>> Errors >> >>>> do. >> >>>> >> >>>> I tried catching all Errors to figure out what was causing this and >> >>>> reproduced the problem (or at least a similar problem by killing the >> >>> MySQL >> >>>> database) and got a stack trace like this (that I've shorted for >> >>> brevity): >> >>>> java.lang.ExceptionInInitializerError >> >>>> at >> >>> >> >>> >> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl >> >>> assGenerator.java:132) >> >>>> at >> >>> >> >>> >> org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(Confi >> >>> guringConnectionDecorator.java:110) >> >>>> at >> >>> >> >>> >> org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataS >> >>> ource.java:118) >> >>>> at >> >>> >> >>> >> org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(Decorating >> >>> DataSource.java:107) >> >>>> ... >> >>>> Caused by: java.lang.reflect.InvocationTargetException >> >>>> at >> >>> sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown >> >>>> Source) >> >>>> at >> >>> >> >>> >> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConst >> >>> ructorAccessorImpl.java:27) >> >>>> at >> >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513) >> >>>> at >> >>> >> >>> >> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl >> >>> assGenerator.java:130) >> >>>> ... 28 more >> >>>> Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: >> >>>> Communications link failure >> >>>> As we can see here [2], ConcreteClassGenerator.newInstance throws an >> >>>> ExceptionInInitializerError when >> >>> java.lang.reflect.Constructor.newInstance >> >>>> throws any Exception; in this case, an InvocationTargetException, >> >>> which >> >>> it >> >>>> throws if the underlying constructor in the reflection throws an >> >>> Exception >> >>>> [3]. And the underlying constructor is throwing a >> >>> CommunicationsException >> >>>> from the MySQL JDBC driver. >> >>>> >> >>>> Long story short, in some cases scheduled tasks can throw an Error >> >>> instead >> >>>> of an Exception when there's a database-related problem. And this >> >>> causes >> >>>> them to not be rescheduled. It is also possible for this to happen >> to >> >>>> other tasks being executed via the SchedulerService besides the >> >>>> CoordMaterializeTriggerService, which would result in other missing >> >>>> behaviors. >> >>>> >> >>>> I also noticed that even when I caught the Error, it still didn't >> >>>> reschedule the task -- not sure why... >> >>>> >> >>>> Any thoughts or suggestions on the best way to recover from this >> >>> situation? >> >>>> >> >>>> Restarting Oozie works, but that's not ideal. >> >>>> One idea I had was to add some kind of "watchdog" thread that would >> >>> somehow >> >>>> notice if a scheduled task hadn't executed in the proper interval + >> >>> some >> >>>> threshold and reschedule it. >> >>>> My other, simper, idea is to try to manually reschedule a new >> >>> identical >> >>>> task in the Error catch block, which I'll try soon. >> >>>> >> >>>> >> >>>> >> >>>> [1] >> >>> >> >>> >> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledTh >> >>> >> readPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,% >> >>> 20long,%20java.util.concurrent.TimeUnit) >> >>>> [2] >> >>> >> >>> >> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjp >> >>> >> a-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f# >> >>> 145 >> >>>> [3] >> >>> >> >>> >> http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.ht >> >>> ml#newInstance(java.lang.Object >> >>>> .. >> >>>> .) >> >>> >> >>> -- >> >>> _____________________________________________________________ >> >>> The information contained in this communication is intended solely for >> >>> the >> >>> use of the individual or entity to whom it is addressed and others >> >>> authorized to receive it. It may contain confidential or legally >> >>> privileged >> >>> information. If you are not the intended recipient you are hereby >> >>> notified >> >>> that any disclosure, copying, distribution or taking any action in >> >>> reliance >> >>> on the contents of this information is strictly prohibited and may be >> >>> unlawful. If you have received this communication in error, please >> >>> notify >> >>> us immediately by responding to this email and then delete it from >> your >> >>> system. The firm is neither liable for the proper and complete >> >>> transmission >> >>> of the information contained in this communication nor for any delay >> in >> >>> its >> >>> receipt. >> > >> > >
