Zhilong Hong created FLINK-23833:
------------------------------------

             Summary: Cache of ShuffleDescriptors should be individually 
cleaned up
                 Key: FLINK-23833
                 URL: https://issues.apache.org/jira/browse/FLINK-23833
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.14.0
            Reporter: Zhilong Hong
             Fix For: 1.14.0


{color:#172b4d}In FLINK-23005, we introduce the cache of compressed serialized 
value for ShuffleDescriptors to improve the performance of deployment. To make 
sure the cache wouldn't stay too long and become a burden for GC, the cache 
would be cleaned up when the partition is released or reset for new execution. 
In the implementation, the cache of the entire IntermediateResult is cleaned up 
because a partition is released only when the entire IntermediateResult is 
released. {color}

{color:#172b4d}However, after FLINK-22017, the BLOCKING result partition is 
allowed to be consumable individually. It also means that the result partition 
doesn't need to wait for other result partitions and can be released 
individually. After this change, there may be a scene: when a result partition 
is finished, the cache of IntermediateResult on the blob is deleted, while 
other result partitions corresponding to this IntermediateResult is just 
deployed to the TaskExecutor. Then when TaskExecutors are trying to download 
TDD from the blob, they will find the blob is deleted and get stuck.{color}

{color:#172b4d}This bug only happens for jobs with POINTWISE BLOCKING edge. 
Also, the {{blob.offload.minsize}} is set to be a extremely small value, since 
the size of  ShuffleDescriptors of POINTWISE BLOCKING edges is usually small. 
To solve this issue, we just need to clean up the cache of ShuffleDescriptors 
individually.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to