[GitHub] [hudi] danny0405 commented on a change in pull request #2319: [MINOR] Improve to only compute fileComparisonsRDD once when 'hoodie.…

GitBox Thu, 10 Dec 2020 23:47:47 -0800


danny0405 commented on a change in pull request #2319:
URL: https://github.com/apache/hudi/pull/2319#discussion_r540751159




##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bloom/SparkHoodieBloomIndex.java
##########
@@ -122,13 +122,15 @@ public SparkHoodieBloomIndex(HoodieWriteConfig config) {
 
     // Step 3: Obtain a RDD, for each incoming record, that already exists, 
with the file id,
     // that contains it.
+    JavaRDD<Tuple2<String, HoodieKey>> fileComparisonsRDD =

Review comment:
       I didn't check the Spark UI yet, just a simple analyze the process of 
data writing. For each batch of records to write, the 
`SparkHoodieBloomIndex.lookupIndex` was expected to be invoked once so the 
`fileComparisonsRDD` should only be evaluated only once, is there other 
invocation for `SparkHoodieBloomIndex.lookupIndex` ? Maybe i missed something.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] danny0405 commented on a change in pull request #2319: [MINOR] Improve to only compute fileComparisonsRDD once when 'hoodie.…

Reply via email to