[ 
https://issues.apache.org/jira/browse/FLINK-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030071#comment-17030071
 ] 

Stephan Ewen commented on FLINK-15900:
--------------------------------------

I looked into the implementation and I could see one possible race condition 
around releasing the memory:

{code}
final boolean allDisposed = sharedResources.release(type, leaseHolder);
if (allDisposed) {
        releaseMemory(type, MemoryType.OFF_HEAP, size);
}
{code}

If another allocation occurs while the releasing thread is between the 
"sharedResources.release"and the "if block", then there is no resource to pick 
up any more and memory has not yet been returned to the memory bookkeeping.
Looks like a legitimate bug.

I would not classify this a release blocker for two reasons:
  - It should be rather rare, we probably see it because of many tests and 
Travis' famous heavy race conditions
  - This should be recoverable by Flink's recovery (failover, re-deploy). We 
only see the test failure because the tests allow for exactly one recovery (of 
the dedicated failure) in the tests. The recovery needed to recover from that 
situation is suppressed by the RestartStrategy.

> JoinITCase#testRightJoinWithPk failed on Travis
> -----------------------------------------------
>
>                 Key: FLINK-15900
>                 URL: https://issues.apache.org/jira/browse/FLINK-15900
>             Project: Flink
>          Issue Type: Bug
>          Components: Table SQL / Planner
>    Affects Versions: 1.10.0
>            Reporter: Gary Yao
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.10.0
>
>
> {noformat}
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
>       at 
> org.apache.flink.table.planner.runtime.stream.sql.JoinITCase.testRightJoinWithPk(JoinITCase.scala:672)
> Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by 
> FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=1, 
> backoffTimeMS=0)
> Caused by: java.lang.Exception: Exception while creating 
> StreamOperatorStateContext.
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
> state backend for KeyedProcessOperator_17aecc34cf8aa256be6fe4836cbdf29a_(2/4) 
> from any of the 1 provided restore options.
> Caused by: java.io.IOException: Failed to acquire shared cache resource for 
> RocksDB
> Caused by: org.apache.flink.runtime.memory.MemoryAllocationException: Could 
> not created the shared memory resource of size 20971520. Not enough memory 
> left to reserve from the slot's managed memory.
> Caused by: org.apache.flink.runtime.memory.MemoryReservationException: Could 
> not allocate 20971520 bytes. Only 0 bytes are remaining.
> {noformat}
> https://api.travis-ci.org/v3/job/645466432/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to