[ 
https://issues.apache.org/jira/browse/FLINK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300001#comment-17300001
 ] 

Xintong Song commented on FLINK-21728:
--------------------------------------

I think [~kezhuw] is correct.

I investigated this by printing the thread and stack if unsafe segment is freed 
multiple times.
{code:java}
ThreadName: SortMerger spilling thread
Stack:
java.lang.Exception
        at 
org.apache.flink.core.memory.HybridMemorySegment.free(HybridMemorySegment.java:146)
        at 
org.apache.flink.runtime.memory.MemoryManager.freeSegment(MemoryManager.java:373)
        at 
org.apache.flink.runtime.memory.MemoryManager.lambda$releaseSegmentsForOwnerUntilNextOwner$2(MemoryManager.java:347)
        at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1877)
        at 
org.apache.flink.runtime.memory.MemoryManager.releaseSegmentsForOwnerUntilNextOwner(MemoryManager.java:344)
        at 
org.apache.flink.runtime.memory.MemoryManager.release(MemoryManager.java:328)
        at 
org.apache.flink.runtime.operators.sort.SpillingThread.disposeSortBuffers(SpillingThread.java:401)
        at 
org.apache.flink.runtime.operators.sort.SpillingThread.mergeInMemory(SpillingThread.java:317)
        at 
org.apache.flink.runtime.operators.sort.SpillingThread.go(SpillingThread.java:178)
        at 
org.apache.flink.runtime.operators.sort.ThreadBase.run(ThreadBase.java:73)

ThreadName: CoGroup (Out-degree) (1/4)#0
Stack:
java.lang.Exception
        at 
org.apache.flink.core.memory.HybridMemorySegment.free(HybridMemorySegment.java:146)
        at 
org.apache.flink.runtime.memory.MemoryManager.freeSegment(MemoryManager.java:373)
        at 
org.apache.flink.runtime.memory.MemoryManager.lambda$releaseSegmentsForOwnerUntilNextOwner$2(MemoryManager.java:347)
        at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1877)
        at 
org.apache.flink.runtime.memory.MemoryManager.releaseSegmentsForOwnerUntilNextOwner(MemoryManager.java:344)
        at 
org.apache.flink.runtime.memory.MemoryManager.release(MemoryManager.java:328)
        at 
org.apache.flink.runtime.operators.sort.ExternalSorter.close(ExternalSorter.java:204)
        at 
org.apache.flink.runtime.operators.BatchTask.closeLocalStrategiesAndCaches(BatchTask.java:586)
        at 
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:360)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:760)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
        at java.lang.Thread.run(Thread.java:748)
{code}

The problem can be fixed by guarding triggering the {{cleaner}} and set it to 
{{null}} with the {{synchronized}} keyword. I've verified that, while I can 
reproduce the crash locally with a few hundred runs, I don't see the problem 
after the fix in more than 10k runs.

> DegreesWithExceptionITCase crash
> --------------------------------
>
>                 Key: FLINK-21728
>                 URL: https://issues.apache.org/jira/browse/FLINK-21728
>             Project: Flink
>          Issue Type: Bug
>          Components: Library / Graph Processing (Gelly)
>    Affects Versions: 1.13.0
>            Reporter: Guowei Ma
>            Priority: Major
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=14422&view=logs&j=ce8f3cc3-c1ea-5281-f5eb-df9ebd24947f&t=f266c805-9429-58ed-2f9e-482e7b82f58b



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to