[ 
https://issues.apache.org/jira/browse/FLINK-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi closed FLINK-2091.
------------------------------
    Resolution: Not A Problem

The core issue was not lock contention, but having too many tasks run at the 
same time. I think the tests, which provoked this issue were running around a 
thousand tasks on 8 slot task managers.

We can think about improving the registration/unregistration logic, but I don't 
think that it is a problem at the moment (especially with regard to the 
release).

> Lock contention during release of network buffer pools
> ------------------------------------------------------
>
>                 Key: FLINK-2091
>                 URL: https://issues.apache.org/jira/browse/FLINK-2091
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Runtime
>    Affects Versions: master
>            Reporter: Ufuk Celebi
>            Assignee: Ufuk Celebi
>
> [~rmetzger] reported the following stack traces during cancelling of high 
> parallelism jobs:
> {code}
> 13:43:46,803 WARN  org.apache.flink.runtime.taskmanager.Task                  
>    - Task 'DataSource (at main(Job.java:59) 
> (org.apache.flink.api.java.io.TextInputFormat)) (4/16)' did not react to 
> cancelling signal, but is stuck in method:
>  
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.setNumBuffers(LocalBufferPool.java:238)
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.redistributeBuffers(NetworkBufferPool.java:268)
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:218)
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:221)
> org.apache.flink.runtime.io.network.partition.ResultPartition.destroyBufferPool(ResultPartition.java:302)
> org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:366)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:647)
> java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> 13:42:57,595 WARN  org.apache.flink.runtime.taskmanager.Task                  
>    - Task 'DataSource (at main(Job.java:59) 
> (org.apache.flink.api.java.io.TextInputFormat)) (16/16)' did not react to 
> cancelling signal, but is stuck in method:
>  
> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.destroyBufferPool(NetworkBufferPool.java:212)
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.lazyDestroy(LocalBufferPool.java:221)
> org.apache.flink.runtime.io.network.partition.ResultPartition.destroyBufferPool(ResultPartition.java:302)
> org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:366)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:647)
> java.lang.Thread.run(Thread.java:745)
> {code}
> The issue is that during cancelling of high parallelism jobs the locks for 
> buffer pool management are highly contended.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to