xushiyan commented on code in PR #7039:
URL: https://github.com/apache/hudi/pull/7039#discussion_r1016732662
##########
hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java:
##########
@@ -44,6 +46,28 @@ public class CommitUtils {
private static final Logger LOG = LogManager.getLogger(CommitUtils.class);
private static final String NULL_SCHEMA_STR =
Schema.create(Schema.Type.NULL).toString();
+ public static transient ConcurrentHashMap<String, List<Integer>>
PERSISTED_RDD_IDS = new ConcurrentHashMap();
Review Comment:
this tightly couples with spark specific logic, and interacting with
spark-internal logic is at this layer is my major concern. And not in favor of
maintaining a global state for distributed processing. Can we try tackling this
within spark client itself? from high-level, we basically want to track the
persisted RDDs' ids and filter the RDDs by the tracked ids when unpersisting.
all these can happen within a client's lifecycle so theoretically this can be
encapsulated well. just to throw some ideas, haven't verified it by checking
the code myself
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]