[jira] [Commented] (FLINK-2089) "Buffer recycled" IllegalStateException during cancelling

ASF GitHub Bot (JIRA) Wed, 27 May 2015 10:06:40 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561276#comment-14561276
 ]


ASF GitHub Bot commented on FLINK-2089:
---------------------------------------

GitHub user uce opened a pull request:

    https://github.com/apache/flink/pull/736

    [FLINK-2089] [runtime] Fix possible duplicate buffer release

    This PR contains multiple independent commits, which address issues 
discovered while debugging FLINK-2089.
    
    - It adds the partition request backoff logic to local requests as well. 
The backoffs were introduced recently for remote requests. I've missed that the 
same problem could also happen for local input channels. The fix was easy and 
moves the backoff logic to the abstract InputChannel, which both Local and 
RemoteInputChannel extend.
    
    - The duplicate buffer release was hard to track. In some corner cases, the 
record serializers were incorrectly holding references to buffers *after* 
written them out to a result partition. In failure cases, the serializers 
recycled these buffers too early. The later recycling (by the component, which 
is actually responsible for this) then resulted in an IllegalStateException.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/incubator-flink cancel-2089

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/736.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #736
    
----
commit e134f44640f2caafc6cff76fd100b11e3aa47515
Author: Ufuk Celebi <u...@apache.org>
Date:   2015-05-26T09:15:17Z

    [runtime] [tests] Add TaskCancelTest

commit d6a33bfda84ea861e05e7a0aff6c529808c02bb2
Author: Ufuk Celebi <u...@apache.org>
Date:   2015-05-26T13:37:35Z

    [FLINK-1636] [runtime] Add partition request backoff logic to 
LocalInputChannel

commit 7ea3ed2ad4c95c1bec0f2d558ba0d4faf9716f14
Author: Ufuk Celebi <u...@apache.org>
Date:   2015-05-27T12:49:01Z

    [FLINK-2089] Fix possible duplicate buffer release
    
    Problem: RecordWriter instances have stateful serializers, which include the
    buffer that they currently work with. If the serializer state is not cleared
    correctly by the writers after writing a buffer to the respective result
    partition, it is possible that buffers are recycled multiple times in 
failure
    cases. This results in an IllegalStateException.
    
    Solution: After writing a buffer to a ResultPartition, the RecordWriter 
makes
    sure that the serializer clears the reference to the respective buffer. The
    recycling of the buffer is then the responsibility of the result partition.

commit 91b6049ac371a62671c36a8280b0a60f1b2b7408
Author: Ufuk Celebi <u...@apache.org>
Date:   2015-05-27T16:26:46Z

    [runtime] [logging] Fix log message

----


> "Buffer recycled" IllegalStateException during cancelling
> ---------------------------------------------------------
>
>                 Key: FLINK-2089
>                 URL: https://issues.apache.org/jira/browse/FLINK-2089
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime
>    Affects Versions: master
>            Reporter: Ufuk Celebi
>            Assignee: Ufuk Celebi
>
> [~rmetzger] reported the following stack trace during cancelling of high 
> parallelism jobs:
> {code}
> Error: java.lang.IllegalStateException: Buffer has already been recycled.
> at 
> org.apache.flink.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:173)
> at 
> org.apache.flink.runtime.io.network.buffer.Buffer.ensureNotRecycled(Buffer.java:142)
> at 
> org.apache.flink.runtime.io.network.buffer.Buffer.getMemorySegment(Buffer.java:78)
> at 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.setNextBuffer(SpillingAdaptiveSpanningRecordDeserializer.java:72)
> at 
> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:80)
> at 
> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34)
> at 
> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
> at org.apache.flink.runtime.operators.MapDriver.run(MapDriver.java:96)
> at 
> org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
> at 
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> This looks like a concurrent buffer pool release/buffer usage error. I'm 
> investing this today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2089) "Buffer recycled" IllegalStateException during cancelling

Reply via email to