[ 
https://issues.apache.org/jira/browse/FLINK-819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304955#comment-14304955
 ] 

Till Rohrmann commented on FLINK-819:
-------------------------------------

Should be tested.

> OutOfMemoryError from TaskManager is causing hard to understand exceptions 
> and blocking JobManager
> --------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-819
>                 URL: https://issues.apache.org/jira/browse/FLINK-819
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Runtime
>            Reporter: GitHub Import
>              Labels: github-import
>             Fix For: pre-apache
>
>
> While doing some pre 0.5 release testing, I saw this exception twice in the 
> JobManager's log.
> It occured during the setup of a new job.
> (Here is the full log: https://gist.github.com/rmetzger/1a9ed4080eedb4e0c8f1)
> It also seems that the task cancellation of the job does not work. The 
> jobmanager does not print any output for more than 15 minutes now. But I 
> think this is a known issue. Pressing "Cancel" does work, also in this 
> situation.
> ```
> 13:42:00,512 ERROR eu.stratosphere.nephele.jobmanager.JobManager              
>    - Cannot check library availability: java.io.IOException: Call to 
> /192.168.7.12:38350 failed on local exception: java.io.EOFException
>       at eu.stratosphere.nephele.ipc.Client.wrapException(Client.java:737)
>       at eu.stratosphere.nephele.ipc.Client.call(Client.java:706)
>       at eu.stratosphere.nephele.ipc.RPC$Invoker.invoke(RPC.java:250)
>       at com.sun.proxy.$Proxy13.updateLibraryCache(Unknown Source)
>       at 
> eu.stratosphere.nephele.instance.AbstractInstance.checkLibraryAvailability(AbstractInstance.java:174)
>       at 
> eu.stratosphere.nephele.jobmanager.JobManager$7.run(JobManager.java:1094)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:701)
> Caused by: java.io.EOFException
>       at java.io.DataInputStream.readInt(DataInputStream.java:392)
>       at 
> eu.stratosphere.nephele.ipc.Client$Connection.receiveResponse(Client.java:497)
>       at eu.stratosphere.nephele.ipc.Client$Connection.run(Client.java:443)
> ```
> Machine 192.168.7.12 has locally the following exception:
> ```
> 13:40:53,120 INFO  eu.stratosphere.nephele.execution.ExecutionStateTransition 
>    - TM: ExecutionState set from FINISHING to FINISHED for task 
> Reduce(<Unnamed Reducer>) (8/8)
> 13:42:00,118 WARN  eu.stratosphere.nephele.ipc.Server                         
>    - Out of Memory in server select
> java.lang.OutOfMemoryError: Java heap space
>         at 
> eu.stratosphere.nephele.execution.librarycache.LibraryCacheManager.readLibraryFromStreamInternal(LibraryCac
> heManager.java:582)
>         at 
> eu.stratosphere.nephele.execution.librarycache.LibraryCacheManager.readLibraryFromStream(LibraryCacheManage
> r.java:556)
>         at 
> eu.stratosphere.nephele.execution.librarycache.LibraryCacheUpdate.read(LibraryCacheUpdate.java:53)
>         at eu.stratosphere.nephele.ipc.RPC$Invocation.read(RPC.java:136)
>         at 
> eu.stratosphere.nephele.ipc.Server$Connection.processData(Server.java:897)
>         at 
> eu.stratosphere.nephele.ipc.Server$Connection.readAndProcess(Server.java:858)
>         at eu.stratosphere.nephele.ipc.Server$Listener.doRead(Server.java:450)
>         at eu.stratosphere.nephele.ipc.Server$Listener.run(Server.java:353)
> 13:43:00,180 INFO  eu.stratosphere.nephele.ipc.Server                         
>    - IPC Server listener on 38350: readAndProcess threw exception 
> java.io.IOException: 
> [dd5ba55f25851ac2f9f0d53971ed92f70cee2afc|https://github.com/stratosphere/stratosphere/commit/dd5ba55f25851ac2f9f0d53971ed92f70cee2afc].jar
>  does not exist in the library cache. Count of bytes read: 0
> java.io.IOException: 
> [dd5ba55f25851ac2f9f0d53971ed92f70cee2afc|https://github.com/stratosphere/stratosphere/commit/dd5ba55f25851ac2f9f0d53971ed92f70cee2afc].jar
>  does not exist in the library cache
>         at 
> eu.stratosphere.nephele.execution.librarycache.LibraryCacheManager.registerInternal(LibraryCacheManager.java:316)
>         at 
> eu.stratosphere.nephele.execution.librarycache.LibraryCacheManager.register(LibraryCacheManager.java:277)
>         at 
> eu.stratosphere.nephele.deployment.TaskDeploymentDescriptor.read(TaskDeploymentDescriptor.java:240)
>         at 
> eu.stratosphere.nephele.util.SerializableArrayList.read(SerializableArrayList.java:100)
>         at eu.stratosphere.nephele.ipc.RPC$Invocation.read(RPC.java:136)
>         at 
> eu.stratosphere.nephele.ipc.Server$Connection.processData(Server.java:897)
>         at 
> eu.stratosphere.nephele.ipc.Server$Connection.readAndProcess(Server.java:858)
>         at eu.stratosphere.nephele.ipc.Server$Listener.doRead(Server.java:450)
>         at eu.stratosphere.nephele.ipc.Server$Listener.run(Server.java:353)
> ```
> How can we improve the user experience here?
> (I have to admit, the TaskManager has only 512 MB heapspace)
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/819
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: bug, question, runtime, 
> Created at: Thu May 15 16:04:17 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to