[ 
https://issues.apache.org/jira/browse/FLINK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14638613#comment-14638613
 ] 

Ufuk Celebi commented on FLINK-2341:
------------------------------------

I think this is an issue with the test.

In general, I think that the asynchronous reader variants were premature in the 
first place (I did this). They add quite some complexity and their merit is 
unmeasured performance-wise. We will probably not recommend anyone to use this 
variant at the moment.

I am thinking about whether it is better to just remove the async reader 
variants and take it up only if it becomes necessary.

> Deadlock in SpilledSubpartitionViewAsyncIO
> ------------------------------------------
>
>                 Key: FLINK-2341
>                 URL: https://issues.apache.org/jira/browse/FLINK-2341
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime
>    Affects Versions: 0.9, 0.10
>            Reporter: Stephan Ewen
>            Assignee: Ufuk Celebi
>            Priority: Critical
>             Fix For: 0.9, 0.10
>
>
> It may be that the deadlock is because of the way the 
> {{SpilledSubpartitionViewTest}} is written
> {code}
> Found one Java-level deadlock:
> =============================
> "pool-25-thread-2":
>   waiting to lock monitor 0x00007f66f4932468 (object 0x00000000fa1478f0, a 
> java.lang.Object),
>   which is held by "IOManager reader thread #1"
> "IOManager reader thread #1":
>   waiting to lock monitor 0x00007f66f4931160 (object 0x00000000fa029768, a 
> java.lang.Object),
>   which is held by "pool-25-thread-2"
> Java stack information for the threads listed above:
> ===================================================
> "pool-25-thread-2":
>       at 
> org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.notifyError(SpilledSubpartitionViewAsyncIO.java:304)
>       - waiting to lock <0x00000000fa1478f0> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.onAvailableBuffer(SpilledSubpartitionViewAsyncIO.java:256)
>       at 
> org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.access$300(SpilledSubpartitionViewAsyncIO.java:42)
>       at 
> org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO$BufferProviderCallback.onEvent(SpilledSubpartitionViewAsyncIO.java:367)
>       at 
> org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO$BufferProviderCallback.onEvent(SpilledSubpartitionViewAsyncIO.java:353)
>       at 
> org.apache.flink.runtime.io.network.util.TestPooledBufferProvider$PooledBufferProviderRecycler.recycle(TestPooledBufferProvider.java:135)
>       - locked <0x00000000fa029768> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.io.network.buffer.Buffer.recycle(Buffer.java:119)
>       - locked <0x00000000fa3a1a20> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.io.network.util.TestSubpartitionConsumer.call(TestSubpartitionConsumer.java:95)
>       at 
> org.apache.flink.runtime.io.network.util.TestSubpartitionConsumer.call(TestSubpartitionConsumer.java:39)
>       at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:701)
> "IOManager reader thread #1":
>       at 
> org.apache.flink.runtime.io.network.util.TestPooledBufferProvider$PooledBufferProviderRecycler.recycle(TestPooledBufferProvider.java:127)
>       - waiting to lock <0x00000000fa029768> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.io.network.buffer.Buffer.recycle(Buffer.java:119)
>       - locked <0x00000000fa3a1ea0> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.returnBufferFromIOThread(SpilledSubpartitionViewAsyncIO.java:270)
>       - locked <0x00000000fa1478f0> (a java.lang.Object)
>       at 
> org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO.access$100(SpilledSubpartitionViewAsyncIO.java:42)
>       at 
> org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO$IOThreadCallback.requestSuccessful(SpilledSubpartitionViewAsyncIO.java:338)
>       at 
> org.apache.flink.runtime.io.network.partition.SpilledSubpartitionViewAsyncIO$IOThreadCallback.requestSuccessful(SpilledSubpartitionViewAsyncIO.java:328)
>       at 
> org.apache.flink.runtime.io.disk.iomanager.AsynchronousFileIOChannel.handleProcessedBuffer(AsynchronousFileIOChannel.java:199)
>       at 
> org.apache.flink.runtime.io.disk.iomanager.BufferReadRequest.requestDone(AsynchronousFileIOChannel.java:431)
>       at 
> org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync$ReaderThread.run(IOManagerAsync.java:377)
> {code}
> The full log with the deadlock stack traces can be found here:
> https://s3.amazonaws.com/archive.travis-ci.org/jobs/70232347/log.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to