vinothchandar commented on a change in pull request #2319:
URL: https://github.com/apache/hudi/pull/2319#discussion_r540767320
##########
File path:
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bloom/SparkHoodieBloomIndex.java
##########
@@ -122,13 +122,15 @@ public SparkHoodieBloomIndex(HoodieWriteConfig config) {
// Step 3: Obtain a RDD, for each incoming record, that already exists,
with the file id,
// that contains it.
+ JavaRDD<Tuple2<String, HoodieKey>> fileComparisonsRDD =
Review comment:
So Spark does lazy evaluation of an RDD. If the RDD is not persisted to
disk/cache, it simply recomputes it. In this case, `fileComparisonsRDD` would
be recomputed twice during runtime. the method
`explodeRecordRDDWithFileComparisons()` will only be called once, but it does
nothing in practice except "define" the RDD it returns.
In contrast, if you notice this line of code at the start of `tagLocation`
```
// Step 0: cache the input record RDD
if (config.getBloomIndexUseCaching()) {
recordRDD.persist(SparkMemoryUtils.getBloomIndexInputStorageLevel(config.getProps()));
}
```
This caches the incoming recordRDD and thus however many times this RDD is
used in the indexing DAG, it will not go to the previous stage. if we did not
have the .persist() in here, then everytime an RDD derived off this `recordRDD`
is needed for a Spark action, it will keep reading from source and compute the
recordRDD again.
Apologies, if you knew all this already. :) and I am failing to see how this
wont happen. but at-least you get my concern with this explanation.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]