[
https://issues.apache.org/jira/browse/FLINK-21728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300001#comment-17300001
]
Xintong Song commented on FLINK-21728:
--------------------------------------
I think [~kezhuw] is correct.
I investigated this by printing the thread and stack if unsafe segment is freed
multiple times.
{code:java}
ThreadName: SortMerger spilling thread
Stack:
java.lang.Exception
at
org.apache.flink.core.memory.HybridMemorySegment.free(HybridMemorySegment.java:146)
at
org.apache.flink.runtime.memory.MemoryManager.freeSegment(MemoryManager.java:373)
at
org.apache.flink.runtime.memory.MemoryManager.lambda$releaseSegmentsForOwnerUntilNextOwner$2(MemoryManager.java:347)
at
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1877)
at
org.apache.flink.runtime.memory.MemoryManager.releaseSegmentsForOwnerUntilNextOwner(MemoryManager.java:344)
at
org.apache.flink.runtime.memory.MemoryManager.release(MemoryManager.java:328)
at
org.apache.flink.runtime.operators.sort.SpillingThread.disposeSortBuffers(SpillingThread.java:401)
at
org.apache.flink.runtime.operators.sort.SpillingThread.mergeInMemory(SpillingThread.java:317)
at
org.apache.flink.runtime.operators.sort.SpillingThread.go(SpillingThread.java:178)
at
org.apache.flink.runtime.operators.sort.ThreadBase.run(ThreadBase.java:73)
ThreadName: CoGroup (Out-degree) (1/4)#0
Stack:
java.lang.Exception
at
org.apache.flink.core.memory.HybridMemorySegment.free(HybridMemorySegment.java:146)
at
org.apache.flink.runtime.memory.MemoryManager.freeSegment(MemoryManager.java:373)
at
org.apache.flink.runtime.memory.MemoryManager.lambda$releaseSegmentsForOwnerUntilNextOwner$2(MemoryManager.java:347)
at
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1877)
at
org.apache.flink.runtime.memory.MemoryManager.releaseSegmentsForOwnerUntilNextOwner(MemoryManager.java:344)
at
org.apache.flink.runtime.memory.MemoryManager.release(MemoryManager.java:328)
at
org.apache.flink.runtime.operators.sort.ExternalSorter.close(ExternalSorter.java:204)
at
org.apache.flink.runtime.operators.BatchTask.closeLocalStrategiesAndCaches(BatchTask.java:586)
at
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:360)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:760)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
at java.lang.Thread.run(Thread.java:748)
{code}
The problem can be fixed by guarding triggering the {{cleaner}} and set it to
{{null}} with the {{synchronized}} keyword. I've verified that, while I can
reproduce the crash locally with a few hundred runs, I don't see the problem
after the fix in more than 10k runs.
> DegreesWithExceptionITCase crash
> --------------------------------
>
> Key: FLINK-21728
> URL: https://issues.apache.org/jira/browse/FLINK-21728
> Project: Flink
> Issue Type: Bug
> Components: Library / Graph Processing (Gelly)
> Affects Versions: 1.13.0
> Reporter: Guowei Ma
> Priority: Major
> Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=14422&view=logs&j=ce8f3cc3-c1ea-5281-f5eb-df9ebd24947f&t=f266c805-9429-58ed-2f9e-482e7b82f58b
--
This message was sent by Atlassian Jira
(v8.3.4#803005)