[GitHub] [hudi] xushiyan commented on a diff in pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation

GitBox Tue, 08 Nov 2022 06:51:12 -0800


xushiyan commented on code in PR #7039:
URL: https://github.com/apache/hudi/pull/7039#discussion_r1016732662



##########
hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java:
##########
@@ -44,6 +46,28 @@ public class CommitUtils {
 
   private static final Logger LOG = LogManager.getLogger(CommitUtils.class);
   private static final String NULL_SCHEMA_STR = 
Schema.create(Schema.Type.NULL).toString();
+  public static transient ConcurrentHashMap<String, List<Integer>> 
PERSISTED_RDD_IDS = new ConcurrentHashMap();

Review Comment:
   this tightly couples with spark specific logic, and interacting with 
spark-internal logic is at this layer is my major concern. And not in favor of 
maintaining a global state for distributed processing. Can we try tackling this 
within spark client itself? from high-level, we basically want to track the 
persisted RDDs' ids and filter the RDDs by the tracked ids when unpersisting. 
all these can happen within a client's lifecycle so theoretically this can be 
encapsulated well. just to throw some ideas, haven't verified it by checking 
the code myself



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xushiyan commented on a diff in pull request #7039: [HUDI-5080] Fixing unpersist to consider only rdds pertaining to current write operation

Reply via email to