Re: [PR] [HUDI-6129] Support rate limit for Spark streaming source [hudi]

via GitHub Wed, 13 Dec 2023 19:26:17 -0800


boneanxs commented on code in PR #10326:
URL: https://github.com/apache/hudi/pull/10326#discussion_r1426133454



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelation.scala:
##########
@@ -148,41 +147,30 @@ class IncrementalRelation(val sqlContext: SQLContext,
       // if first commit in a table is an empty commit without schema, return 
empty RDD here
       sqlContext.sparkContext.emptyRDD[Row]
     } else {
-      val regularFileIdToFullPath = mutable.HashMap[String, String]()
-      var metaBootstrapFileIdToFullPath = mutable.HashMap[String, String]()
-
-      // create Replaced file group
-      val replacedTimeline = 
commitsTimelineToReturn.getCompletedReplaceTimeline
-      val replacedFile = replacedTimeline.getInstants.flatMap { instant =>
-        val replaceMetadata = HoodieReplaceCommitMetadata.
-          
fromBytes(metaClient.getActiveTimeline.getInstantDetails(instant).get, 
classOf[HoodieReplaceCommitMetadata])
-        replaceMetadata.getPartitionToReplaceFileIds.entrySet().flatMap { 
entry =>
-          entry.getValue.map { e =>
-            val fullPath = FSUtils.getPartitionPath(basePath, 
entry.getKey).toString
-            (e, fullPath)
-          }
-        }
-      }.toMap
-
-      for (commit <- commitsToReturn) {
-        val metadata: HoodieCommitMetadata = 
HoodieCommitMetadata.fromBytes(commitTimeline.getInstantDetails(commit)
-          .get, classOf[HoodieCommitMetadata])
-
-        if (HoodieTimeline.METADATA_BOOTSTRAP_INSTANT_TS == 
commit.getTimestamp) {
-          metaBootstrapFileIdToFullPath ++= 
metadata.getFileIdAndFullPaths(basePath).toMap.filterNot { case (k, v) =>
-            replacedFile.contains(k) && v.startsWith(replacedFile(k))
-          }
-        } else {
-          regularFileIdToFullPath ++= 
metadata.getFileIdAndFullPaths(basePath).toMap.filterNot { case (k, v) =>
-            replacedFile.contains(k) && v.startsWith(replacedFile(k))

Review Comment:
   Now in this pr, I simply ignore `clustering`, `compact`, `log_compact` 
commits, and will read all data from `replaceCommit` like `insert_overwrite` 
and `insert_overwrite_table` with overwritten files if exist in this range.
   From my perspective, `insert`, `insert_overwrite` are all changes that 
should let downstream consumers know, instead of filtering it if 
`insert_overwrite` overwrites some data that `insert` has written before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-6129] Support rate limit for Spark streaming source [hudi]

Reply via email to