[ https://issues.apache.org/jira/browse/SAMZA-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bharath Kumarasubramanian updated SAMZA-2491: --------------------------------------------- Fix Version/s: 1.5 > AM should log uncaught exceptions and System.exit to ensure that the process > dies on errors > ------------------------------------------------------------------------------------------- > > Key: SAMZA-2491 > URL: https://issues.apache.org/jira/browse/SAMZA-2491 > Project: Samza > Issue Type: Improvement > Reporter: Hai Lu > Assignee: Hai Lu > Priority: Major > Fix For: 1.5 > > Time Spent: 0.5h > Remaining Estimate: 0h > > From: pmaheshw > Symptom: A job deployment timed out waiting for application attempt to > transition from New to Running. > Cause: ClusterBasedJobCoordinator threw an exception during startup due to a > misconfiguration, but did not kill the AM process (likely due to non-daemon > threads). > Suggested fixes: > 1. ClusterBasedJobCoordinator#main doesn't use an uncaught exception handler, > and doesn't catch + log any exceptions thrown from ClusterBasedJobCoordinator > constructor or from run(). We should fix this. Uncaught exceptions go to > stderr instead of logs and do not have a timestamp, which makes debugging > difficult. E.g.: > Exception in thread "main" org.apache.samza.SamzaException: Cannot get > systemAdmin for system aggregate-tracking > at org.apache.samza.system.SystemAdmins.getSystemAdmin(SystemAdmins.java:63) > at > org.apache.samza.system.StreamMetadataCache$$anonfun$3.apply(StreamMetadataCache.scala:66) > at > org.apache.samza.system.StreamMetadataCache$$anonfun$3.apply(StreamMetadataCache.scala:64) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.Map$Map2.foreach(Map.scala:137) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.samza.system.StreamMetadataCache.getStreamMetadata(StreamMetadataCache.scala:64) > at > org.apache.samza.coordinator.StreamPartitionCountMonitor.getMetadata(StreamPartitionCountMonitor.java:92) > at > org.apache.samza.coordinator.StreamPartitionCountMonitor.<init>(StreamPartitionCountMonitor.java:113) > at > org.apache.samza.clustermanager.ClusterBasedJobCoordinator.getPartitionCountMonitor(ClusterBasedJobCoordinator.java:343) > at > org.apache.samza.clustermanager.ClusterBasedJobCoordinator.<init>(ClusterBasedJobCoordinator.java:207) > at > org.apache.samza.clustermanager.ClusterBasedJobCoordinator.main(ClusterBasedJobCoordinator.java:441) > 2. JC should call System.exit on returning from main (cleanly or on > exception) and from the uncaught exception handler to ensure that the AM > process dies on these errors and does not leave the deployment hanging. We've > also seen this issue due to client libraries (datavault, brooklin, kafka > etc.) creating non-daemon threads and not stopping them cleanly. See > LocalContainerRunner for reference, which does kill the process on returning > from main thread. E.g., in this case its threads like this: > "AsyncHttpClient-27-1" #134 prio=5 os_prio=0 tid=0x00007faead675000 > nid=0x4151 runnable [0x00007fae9c9da000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) > - locked <0x00000000fe6a2f40> (a > com.linkedin.mario.shaded.io.netty.channel.nio.SelectedSelectionKeySet) > - locked <0x00000000fe6fe9c0> (a java.util.Collections$UnmodifiableSet) > - locked <0x00000000fe6a3f68> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > at > com.linkedin.mario.shaded.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62) > at > com.linkedin.mario.shaded.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:824) > at > com.linkedin.mario.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457) > at > com.linkedin.mario.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) > at > com.linkedin.mario.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > com.linkedin.mario.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian Jira (v8.3.4#803005)