[ 
https://issues.apache.org/jira/browse/SAMZA-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bharath Kumarasubramanian updated SAMZA-2491:
---------------------------------------------
    Fix Version/s: 1.5

> AM should log uncaught exceptions and System.exit to ensure that the process 
> dies on errors
> -------------------------------------------------------------------------------------------
>
>                 Key: SAMZA-2491
>                 URL: https://issues.apache.org/jira/browse/SAMZA-2491
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Hai Lu
>            Assignee: Hai Lu
>            Priority: Major
>             Fix For: 1.5
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> From: pmaheshw
> Symptom: A job deployment timed out waiting for application attempt to 
> transition from New to Running.
> Cause: ClusterBasedJobCoordinator threw an exception during startup due to a 
> misconfiguration, but did not kill the AM process (likely due to non-daemon 
> threads).
> Suggested fixes:
> 1. ClusterBasedJobCoordinator#main doesn't use an uncaught exception handler, 
> and doesn't catch + log any exceptions thrown from ClusterBasedJobCoordinator 
> constructor or from run(). We should fix this. Uncaught exceptions go to 
> stderr instead of logs and do not have a timestamp, which makes debugging 
> difficult. E.g.:
> Exception in thread "main" org.apache.samza.SamzaException: Cannot get 
> systemAdmin for system aggregate-tracking
> at org.apache.samza.system.SystemAdmins.getSystemAdmin(SystemAdmins.java:63)
> at 
> org.apache.samza.system.StreamMetadataCache$$anonfun$3.apply(StreamMetadataCache.scala:66)
> at 
> org.apache.samza.system.StreamMetadataCache$$anonfun$3.apply(StreamMetadataCache.scala:64)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> at 
> org.apache.samza.system.StreamMetadataCache.getStreamMetadata(StreamMetadataCache.scala:64)
> at 
> org.apache.samza.coordinator.StreamPartitionCountMonitor.getMetadata(StreamPartitionCountMonitor.java:92)
> at 
> org.apache.samza.coordinator.StreamPartitionCountMonitor.<init>(StreamPartitionCountMonitor.java:113)
> at 
> org.apache.samza.clustermanager.ClusterBasedJobCoordinator.getPartitionCountMonitor(ClusterBasedJobCoordinator.java:343)
> at 
> org.apache.samza.clustermanager.ClusterBasedJobCoordinator.<init>(ClusterBasedJobCoordinator.java:207)
> at 
> org.apache.samza.clustermanager.ClusterBasedJobCoordinator.main(ClusterBasedJobCoordinator.java:441)
> 2. JC should call System.exit on returning from main (cleanly or on 
> exception) and from the uncaught exception handler to ensure that the AM 
> process dies on these errors and does not leave the deployment hanging. We've 
> also seen this issue due to client libraries (datavault, brooklin, kafka 
> etc.) creating non-daemon threads and not stopping them cleanly. See 
> LocalContainerRunner for reference, which does kill the process on returning 
> from main thread. E.g., in this case its threads like this:
> "AsyncHttpClient-27-1" #134 prio=5 os_prio=0 tid=0x00007faead675000 
> nid=0x4151 runnable [0x00007fae9c9da000]
> java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>  - locked <0x00000000fe6a2f40> (a 
> com.linkedin.mario.shaded.io.netty.channel.nio.SelectedSelectionKeySet)
>  - locked <0x00000000fe6fe9c0> (a java.util.Collections$UnmodifiableSet)
>  - locked <0x00000000fe6a3f68> (a sun.nio.ch.EPollSelectorImpl)
> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
> at 
> com.linkedin.mario.shaded.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
> at 
> com.linkedin.mario.shaded.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:824)
> at 
> com.linkedin.mario.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
> at 
> com.linkedin.mario.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
> at 
> com.linkedin.mario.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at 
> com.linkedin.mario.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to