[ https://issues.apache.org/jira/browse/FLINK-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304955#comment-14304955 ]
Till Rohrmann commented on FLINK-819: ------------------------------------- Should be tested. > OutOfMemoryError from TaskManager is causing hard to understand exceptions > and blocking JobManager > -------------------------------------------------------------------------------------------------- > > Key: FLINK-819 > URL: https://issues.apache.org/jira/browse/FLINK-819 > Project: Flink > Issue Type: Improvement > Components: Distributed Runtime > Reporter: GitHub Import > Labels: github-import > Fix For: pre-apache > > > While doing some pre 0.5 release testing, I saw this exception twice in the > JobManager's log. > It occured during the setup of a new job. > (Here is the full log: https://gist.github.com/rmetzger/1a9ed4080eedb4e0c8f1) > It also seems that the task cancellation of the job does not work. The > jobmanager does not print any output for more than 15 minutes now. But I > think this is a known issue. Pressing "Cancel" does work, also in this > situation. > ``` > 13:42:00,512 ERROR eu.stratosphere.nephele.jobmanager.JobManager > - Cannot check library availability: java.io.IOException: Call to > /192.168.7.12:38350 failed on local exception: java.io.EOFException > at eu.stratosphere.nephele.ipc.Client.wrapException(Client.java:737) > at eu.stratosphere.nephele.ipc.Client.call(Client.java:706) > at eu.stratosphere.nephele.ipc.RPC$Invoker.invoke(RPC.java:250) > at com.sun.proxy.$Proxy13.updateLibraryCache(Unknown Source) > at > eu.stratosphere.nephele.instance.AbstractInstance.checkLibraryAvailability(AbstractInstance.java:174) > at > eu.stratosphere.nephele.jobmanager.JobManager$7.run(JobManager.java:1094) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:701) > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > eu.stratosphere.nephele.ipc.Client$Connection.receiveResponse(Client.java:497) > at eu.stratosphere.nephele.ipc.Client$Connection.run(Client.java:443) > ``` > Machine 192.168.7.12 has locally the following exception: > ``` > 13:40:53,120 INFO eu.stratosphere.nephele.execution.ExecutionStateTransition > - TM: ExecutionState set from FINISHING to FINISHED for task > Reduce(<Unnamed Reducer>) (8/8) > 13:42:00,118 WARN eu.stratosphere.nephele.ipc.Server > - Out of Memory in server select > java.lang.OutOfMemoryError: Java heap space > at > eu.stratosphere.nephele.execution.librarycache.LibraryCacheManager.readLibraryFromStreamInternal(LibraryCac > heManager.java:582) > at > eu.stratosphere.nephele.execution.librarycache.LibraryCacheManager.readLibraryFromStream(LibraryCacheManage > r.java:556) > at > eu.stratosphere.nephele.execution.librarycache.LibraryCacheUpdate.read(LibraryCacheUpdate.java:53) > at eu.stratosphere.nephele.ipc.RPC$Invocation.read(RPC.java:136) > at > eu.stratosphere.nephele.ipc.Server$Connection.processData(Server.java:897) > at > eu.stratosphere.nephele.ipc.Server$Connection.readAndProcess(Server.java:858) > at eu.stratosphere.nephele.ipc.Server$Listener.doRead(Server.java:450) > at eu.stratosphere.nephele.ipc.Server$Listener.run(Server.java:353) > 13:43:00,180 INFO eu.stratosphere.nephele.ipc.Server > - IPC Server listener on 38350: readAndProcess threw exception > java.io.IOException: > [dd5ba55f25851ac2f9f0d53971ed92f70cee2afc|https://github.com/stratosphere/stratosphere/commit/dd5ba55f25851ac2f9f0d53971ed92f70cee2afc].jar > does not exist in the library cache. Count of bytes read: 0 > java.io.IOException: > [dd5ba55f25851ac2f9f0d53971ed92f70cee2afc|https://github.com/stratosphere/stratosphere/commit/dd5ba55f25851ac2f9f0d53971ed92f70cee2afc].jar > does not exist in the library cache > at > eu.stratosphere.nephele.execution.librarycache.LibraryCacheManager.registerInternal(LibraryCacheManager.java:316) > at > eu.stratosphere.nephele.execution.librarycache.LibraryCacheManager.register(LibraryCacheManager.java:277) > at > eu.stratosphere.nephele.deployment.TaskDeploymentDescriptor.read(TaskDeploymentDescriptor.java:240) > at > eu.stratosphere.nephele.util.SerializableArrayList.read(SerializableArrayList.java:100) > at eu.stratosphere.nephele.ipc.RPC$Invocation.read(RPC.java:136) > at > eu.stratosphere.nephele.ipc.Server$Connection.processData(Server.java:897) > at > eu.stratosphere.nephele.ipc.Server$Connection.readAndProcess(Server.java:858) > at eu.stratosphere.nephele.ipc.Server$Listener.doRead(Server.java:450) > at eu.stratosphere.nephele.ipc.Server$Listener.run(Server.java:353) > ``` > How can we improve the user experience here? > (I have to admit, the TaskManager has only 512 MB heapspace) > ---------------- Imported from GitHub ---------------- > Url: https://github.com/stratosphere/stratosphere/issues/819 > Created by: [rmetzger|https://github.com/rmetzger] > Labels: bug, question, runtime, > Created at: Thu May 15 16:04:17 CEST 2014 > State: open -- This message was sent by Atlassian JIRA (v6.3.4#6332)