Zhilong Hong created FLINK-23833:
------------------------------------
Summary: Cache of ShuffleDescriptors should be individually
cleaned up
Key: FLINK-23833
URL: https://issues.apache.org/jira/browse/FLINK-23833
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.14.0
Reporter: Zhilong Hong
Fix For: 1.14.0
{color:#172b4d}In FLINK-23005, we introduce the cache of compressed serialized
value for ShuffleDescriptors to improve the performance of deployment. To make
sure the cache wouldn't stay too long and become a burden for GC, the cache
would be cleaned up when the partition is released or reset for new execution.
In the implementation, the cache of the entire IntermediateResult is cleaned up
because a partition is released only when the entire IntermediateResult is
released. {color}
{color:#172b4d}However, after FLINK-22017, the BLOCKING result partition is
allowed to be consumable individually. It also means that the result partition
doesn't need to wait for other result partitions and can be released
individually. After this change, there may be a scene: when a result partition
is finished, the cache of IntermediateResult on the blob is deleted, while
other result partitions corresponding to this IntermediateResult is just
deployed to the TaskExecutor. Then when TaskExecutors are trying to download
TDD from the blob, they will find the blob is deleted and get stuck.{color}
{color:#172b4d}This bug only happens for jobs with POINTWISE BLOCKING edge.
Also, the {{blob.offload.minsize}} is set to be a extremely small value, since
the size of ShuffleDescriptors of POINTWISE BLOCKING edges is usually small.
To solve this issue, we just need to clean up the cache of ShuffleDescriptors
individually.{color}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)