[
https://issues.apache.org/jira/browse/FLINK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709214#comment-14709214
]
ASF GitHub Bot commented on FLINK-2089:
---------------------------------------
GitHub user uce opened a pull request:
https://github.com/apache/flink/pull/1050
[FLINK-2089] [runtime] Fix illegal state in RecordWriter after partition
write failure
I'm waiting for feedback from a user whether this fixes FLINK-2089, but
this PR definitely addresses a problem.
Record writers have a `clearBuffers` method, which is called by the task
code in a finally block at the end of `invoke` (see `RegularPactTask` for
example). This call clears the buffers of the record serializers.
The following illegal state can arise: a buffer has been published to a
partition, but the serializers still hold a reference to it. When a serializer
tries to clear its current buffer, it might have already been recycled (because
it was published to the partition). This will currently happen if there was an
Exception during writing the buffer to the partition.
This PR replaces the write-and-clear calls with a try-catch-finally block
and tests the expected behaviour in a new test. The removed tests are
superseded by this new test.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/uce/flink illegal-2089-master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1050.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1050
----
commit 88ca58b5e3e78354ca1cffee4e11b48011333c6b
Author: Ufuk Celebi <[email protected]>
Date: 2015-08-19T14:11:13Z
[FLINK-2089] [runtime] Fix illegal state in RecordWriter after partition
write failure
----
> "Buffer recycled" IllegalStateException during cancelling
> ---------------------------------------------------------
>
> Key: FLINK-2089
> URL: https://issues.apache.org/jira/browse/FLINK-2089
> Project: Flink
> Issue Type: Bug
> Components: Distributed Runtime
> Affects Versions: master
> Reporter: Ufuk Celebi
> Assignee: Ufuk Celebi
> Fix For: 0.9
>
>
> [~rmetzger] reported the following stack trace during cancelling of high
> parallelism jobs:
> {code}
> Error: java.lang.IllegalStateException: Buffer has already been recycled.
> at
> org.apache.flink.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:173)
> at
> org.apache.flink.runtime.io.network.buffer.Buffer.ensureNotRecycled(Buffer.java:142)
> at
> org.apache.flink.runtime.io.network.buffer.Buffer.getMemorySegment(Buffer.java:78)
> at
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.setNextBuffer(SpillingAdaptiveSpanningRecordDeserializer.java:72)
> at
> org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:80)
> at
> org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34)
> at
> org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:73)
> at org.apache.flink.runtime.operators.MapDriver.run(MapDriver.java:96)
> at
> org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
> at
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> This looks like a concurrent buffer pool release/buffer usage error. I'm
> investing this today.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)