[me off at the moment, just seen this and wanted to quickly point out something about java errors]
http://stackoverflow.com/questions/11017304/catching-java-errors see accepted response thx Alejandro (phone typing) > On Feb 24, 2014, at 10:43, Virag Kothari <[email protected]> wrote: > > The coord materialization runnable always does the same task which is > bringing coord jobs eligible for materialization on Oozie server. > So, it shouldn't matter if the one of the task scheduled is lost due to an > error or exception. If only one coord job stops materializing and not > others, I am also thinking that you might be hitting OOZIE-1699. Do you > see that a single job materialization works fine when there is an > Exception but never proceeds when the runnable throws an Error? > > Thanks, > Virag > >> On 2/24/14 10:16 AM, "Robert Kanter" <[email protected]> wrote: >> >> Shwetha, while that sounds similar to the issue I described, and may have >> a >> similar solution, I don't think they're the same issue. Your issue looks >> like it has to do with some logic Oozie is using in CallableQueueService >> where its not handling Exceptions properly in an edge case. My issue has >> to do with how the Java's ThreadPool handles Runnables that throw an >> Error. >> >> >> >>> On Mon, Feb 24, 2014 at 1:39 AM, Shwetha GS <[email protected]> wrote: >>> >>> We had the same issue. We found that because of an edge case, the >>> callable >>> command(CoordMaterializeTransitionXCommand in this case which does the >>> actual materialisation) is added to the unique callable(in >>> CallableQueueService), but is never added to the queue. Hence the >>> materialisation doesn't happen. Since >>> CoordMaterializeTransitionXCommand is >>> already in unique callable, further recovery >>> from CoordMaterializeTriggerService doesn't enqueue any >>> more CoordMaterializeTransitionXCommand. The issue is being fixed as >>> part >>> of https://issues.apache.org/jira/browse/OOZIE-1699. This issue can >>> happen >>> with any command(not just materialisation). >>> >>> To verify if this is indeed the issue, >>> 1. CoordMaterializeTriggerService should pick up this coord every 5 >>> mins(materialisation interval) >>> 2. CoordMaterializeTransitionXCommand never runs >>> 3. oozie admin -queuedump lists that coord for mater in unique callable >>> >>> >>> >>> >>> On Sat, Feb 22, 2014 at 5:29 AM, Robert Kanter <[email protected]> >>> wrote: >>> >>>> Sorry for the long message, but its hard to explain the issue >>> succinctly. >>>> >>>> We recently saw a weird issue where Oozie suddenly stopped >>> materializing >>>> actions for a Coordinator job. All we saw in the logs was this >>> message >>>> 2014-01-31 02:10:03,378 WARN org.apache.oozie.service.JPAService: >>> USER[-] >>>> GROUP[-] TOKEN[-] APP[-] JOB[-] ACTION[-] JPAExecutor >>>> [CoordJobsToBeMaterializedJPAExecutor] ended with an active >>> transaction, >>>> rolling back >>>> >>>> After a bunch of digging, I figured out what was happening: >>>> >>>> The CoordMaterializeTriggerService is responsible for materializing >>> Coord >>>> actions via a Runnable that gets scheduled every 5min via the >>>> SchedulerService. The SchedulerService is backed by a >>>> ScheduledThreadPoolExecutor. The Runnable does a bunch of things, but >>> the >>>> important thing is that it executes >>> CoordJobsToBeMaterializedJPAExecutor >>>> via the JPAService. In the JPAService#execute(...) method, this >>> happens: >>>> try { >>>> if (em.getTransaction().isActive()) { >>>> LOG.warn("JPAExecutor [{0}] ended with an active transaction, >>>> rolling back", executor.getName()); >>>> *em.getTransaction().rollback();* >>>> } >>>> } >>>> catch (Exception ex) { >>>> LOG.warn("Could not check/rollback transaction after >>> JPAExecutor >>>> [{0}], {1}", executor.getName(), ex >>>> .getMessage(), ex); >>>> } >>>> I've determined that the bold line (the rollback) can throw an Error >>> (as >>> in >>>> the Throwable, not a synonym for Exception). For obvious reasons, >>> we're >>>> not catching this Error anywhere. This causes the Thread to exit; >>> however, >>>> we don't "lose" any threads because the ScheduledThreadPoolExecutor >>> will >>>> simply start a new one. However, the task that the thread was >>> executed >>> is >>>> dropped from the schedule. According to the documentation [1] it says >>> "If >>>> any execution of the task encounters an exception, subsequent >>> executions >>>> are suppressed." I did some testing, and that's not quite true. >>>> Exceptions don't cause subsequent executions to be suppressed, but >>> Errors >>>> do. >>>> >>>> I tried catching all Errors to figure out what was causing this and >>>> reproduced the problem (or at least a similar problem by killing the >>> MySQL >>>> database) and got a stack trace like this (that I've shorted for >>> brevity): >>>> java.lang.ExceptionInInitializerError >>>> at >>> >>> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl >>> assGenerator.java:132) >>>> at >>> >>> org.apache.openjpa.lib.jdbc.ConfiguringConnectionDecorator.decorate(Confi >>> guringConnectionDecorator.java:110) >>>> at >>> >>> org.apache.openjpa.lib.jdbc.DecoratingDataSource.decorate(DecoratingDataS >>> ource.java:118) >>>> at >>> >>> org.apache.openjpa.lib.jdbc.DecoratingDataSource.getConnection(Decorating >>> DataSource.java:107) >>>> ... >>>> Caused by: java.lang.reflect.InvocationTargetException >>>> at >>> sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown >>>> Source) >>>> at >>> >>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConst >>> ructorAccessorImpl.java:27) >>>> at >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513) >>>> at >>> >>> org.apache.openjpa.lib.util.ConcreteClassGenerator.newInstance(ConcreteCl >>> assGenerator.java:130) >>>> ... 28 more >>>> Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: >>>> Communications link failure >>>> As we can see here [2], ConcreteClassGenerator.newInstance throws an >>>> ExceptionInInitializerError when >>> java.lang.reflect.Constructor.newInstance >>>> throws any Exception; in this case, an InvocationTargetException, >>> which >>> it >>>> throws if the underlying constructor in the reflection throws an >>> Exception >>>> [3]. And the underlying constructor is throwing a >>> CommunicationsException >>>> from the MySQL JDBC driver. >>>> >>>> Long story short, in some cases scheduled tasks can throw an Error >>> instead >>>> of an Exception when there's a database-related problem. And this >>> causes >>>> them to not be rescheduled. It is also possible for this to happen to >>>> other tasks being executed via the SchedulerService besides the >>>> CoordMaterializeTriggerService, which would result in other missing >>>> behaviors. >>>> >>>> I also noticed that even when I caught the Error, it still didn't >>>> reschedule the task -- not sure why... >>>> >>>> Any thoughts or suggestions on the best way to recover from this >>> situation? >>>> >>>> Restarting Oozie works, but that's not ideal. >>>> One idea I had was to add some kind of "watchdog" thread that would >>> somehow >>>> notice if a scheduled task hadn't executed in the proper interval + >>> some >>>> threshold and reschedule it. >>>> My other, simper, idea is to try to manually reschedule a new >>> identical >>>> task in the Error catch block, which I'll try soon. >>>> >>>> >>>> >>>> [1] >>> >>> http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledTh >>> readPoolExecutor.html#scheduleWithFixedDelay(java.lang.Runnable,%20long,% >>> 20long,%20java.util.concurrent.TimeUnit) >>>> [2] >>> >>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.openjpa/openjp >>> a-lib/2.2.2/org/apache/openjpa/lib/util/ConcreteClassGenerator.java?av=f# >>> 145 >>>> [3] >>> >>> http://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.ht >>> ml#newInstance(java.lang.Object >>>> .. >>>> .) >>> >>> -- >>> _____________________________________________________________ >>> The information contained in this communication is intended solely for >>> the >>> use of the individual or entity to whom it is addressed and others >>> authorized to receive it. It may contain confidential or legally >>> privileged >>> information. If you are not the intended recipient you are hereby >>> notified >>> that any disclosure, copying, distribution or taking any action in >>> reliance >>> on the contents of this information is strictly prohibited and may be >>> unlawful. If you have received this communication in error, please >>> notify >>> us immediately by responding to this email and then delete it from your >>> system. The firm is neither liable for the proper and complete >>> transmission >>> of the information contained in this communication nor for any delay in >>> its >>> receipt. >
