[ https://issues.apache.org/jira/browse/FLINK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561276#comment-14561276 ]
ASF GitHub Bot commented on FLINK-2089: --------------------------------------- GitHub user uce opened a pull request: https://github.com/apache/flink/pull/736 [FLINK-2089] [runtime] Fix possible duplicate buffer release This PR contains multiple independent commits, which address issues discovered while debugging FLINK-2089. - It adds the partition request backoff logic to local requests as well. The backoffs were introduced recently for remote requests. I've missed that the same problem could also happen for local input channels. The fix was easy and moves the backoff logic to the abstract InputChannel, which both Local and RemoteInputChannel extend. - The duplicate buffer release was hard to track. In some corner cases, the record serializers were incorrectly holding references to buffers *after* written them out to a result partition. In failure cases, the serializers recycled these buffers too early. The later recycling (by the component, which is actually responsible for this) then resulted in an IllegalStateException. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/incubator-flink cancel-2089 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/736.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #736 ---- commit e134f44640f2caafc6cff76fd100b11e3aa47515 Author: Ufuk Celebi <u...@apache.org> Date: 2015-05-26T09:15:17Z [runtime] [tests] Add TaskCancelTest commit d6a33bfda84ea861e05e7a0aff6c529808c02bb2 Author: Ufuk Celebi <u...@apache.org> Date: 2015-05-26T13:37:35Z [FLINK-1636] [runtime] Add partition request backoff logic to LocalInputChannel commit 7ea3ed2ad4c95c1bec0f2d558ba0d4faf9716f14 Author: Ufuk Celebi <u...@apache.org> Date: 2015-05-27T12:49:01Z [FLINK-2089] Fix possible duplicate buffer release Problem: RecordWriter instances have stateful serializers, which include the buffer that they currently work with. If the serializer state is not cleared correctly by the writers after writing a buffer to the respective result partition, it is possible that buffers are recycled multiple times in failure cases. This results in an IllegalStateException. Solution: After writing a buffer to a ResultPartition, the RecordWriter makes sure that the serializer clears the reference to the respective buffer. The recycling of the buffer is then the responsibility of the result partition. commit 91b6049ac371a62671c36a8280b0a60f1b2b7408 Author: Ufuk Celebi <u...@apache.org> Date: 2015-05-27T16:26:46Z [runtime] [logging] Fix log message ---- > "Buffer recycled" IllegalStateException during cancelling > --------------------------------------------------------- > > Key: FLINK-2089 > URL: https://issues.apache.org/jira/browse/FLINK-2089 > Project: Flink > Issue Type: Bug > Components: Distributed Runtime > Affects Versions: master > Reporter: Ufuk Celebi > Assignee: Ufuk Celebi > > [~rmetzger] reported the following stack trace during cancelling of high > parallelism jobs: > {code} > Error: java.lang.IllegalStateException: Buffer has already been recycled. > at > org.apache.flink.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:173) > at > org.apache.flink.runtime.io.network.buffer.Buffer.ensureNotRecycled(Buffer.java:142) > at > org.apache.flink.runtime.io.network.buffer.Buffer.getMemorySegment(Buffer.java:78) > at > org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.setNextBuffer(SpillingAdaptiveSpanningRecordDeserializer.java:72) > at > org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:80) > at > org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34) > at > org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73) > at org.apache.flink.runtime.operators.MapDriver.run(MapDriver.java:96) > at > org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) > at > org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) > at java.lang.Thread.run(Thread.java:745) > {code} > This looks like a concurrent buffer pool release/buffer usage error. I'm > investing this today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)