vinothchandar commented on a change in pull request #2319:
URL: https://github.com/apache/hudi/pull/2319#discussion_r540767320



##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bloom/SparkHoodieBloomIndex.java
##########
@@ -122,13 +122,15 @@ public SparkHoodieBloomIndex(HoodieWriteConfig config) {
 
     // Step 3: Obtain a RDD, for each incoming record, that already exists, 
with the file id,
     // that contains it.
+    JavaRDD<Tuple2<String, HoodieKey>> fileComparisonsRDD =

Review comment:
       So Spark does lazy evaluation of an RDD. If the RDD is not persisted to 
disk/cache, it simply recomputes it. In this case, `fileComparisonsRDD` would 
be recomputed twice during runtime. the method 
`explodeRecordRDDWithFileComparisons()` will only be called once, but it does 
nothing in practice except "define" the RDD it returns. 
   
   In contrast, if you notice this line of code at the start of `tagLocation`
   
   ```
   // Step 0: cache the input record RDD
       if (config.getBloomIndexUseCaching()) {
         
recordRDD.persist(SparkMemoryUtils.getBloomIndexInputStorageLevel(config.getProps()));
       }
   ```
   
   This caches the incoming recordRDD and thus however many times this RDD is 
used in the indexing DAG, it will not go to the previous stage. if we did not 
have the .persist() in here, then everytime an RDD derived off this `recordRDD` 
is needed for a Spark action, it will keep reading from source and compute the 
recordRDD again. 
   
   Apologies, if you knew all this already. :) and I am failing to see how this 
wont happen. but at-least you get my concern with this explanation.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to