[GitHub] [hudi] vinothchandar commented on a change in pull request #2319: [MINOR] Improve to only compute fileComparisonsRDD once when 'hoodie.…

GitBox Thu, 10 Dec 2020 23:22:08 -0800


vinothchandar commented on a change in pull request #2319:
URL: https://github.com/apache/hudi/pull/2319#discussion_r540738854




##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bloom/SparkHoodieBloomIndex.java
##########
@@ -122,13 +122,15 @@ public SparkHoodieBloomIndex(HoodieWriteConfig config) {
 
     // Step 3: Obtain a RDD, for each incoming record, that already exists, 
with the file id,
     // that contains it.
+    JavaRDD<Tuple2<String, HoodieKey>> fileComparisonsRDD =

Review comment:
       do you actually see from the Spark UI that its not computed twice? I ask 
because, `fileComparisonsRDD` is not cached and thus even though this is 
declared only once, during runtime, Spark will lazily recompute 
`fileComparisonsRDD` once for each method that uses it. 
   
   

##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/bloom/SparkHoodieBloomIndex.java
##########
@@ -252,11 +252,10 @@ public boolean isImplicitWithStorage() {
    * Make sure the parallelism is atleast the groupby parallelism for tagging 
location
    */
   JavaPairRDD<HoodieKey, HoodieRecordLocation> findMatchingFilesForRecordKeys(
-      final Map<String, List<BloomIndexFileInfo>> partitionToFileIndexInfo,
-      JavaPairRDD<String, String> partitionRecordKeyPairRDD, int 
shuffleParallelism, HoodieTable hoodieTable,
+      JavaRDD<Tuple2<String, HoodieKey>> fileComparisonsRDD,
+      int shuffleParallelism,
+         HoodieTable hoodieTable,

Review comment:
       nit: wondering how checkstyle is happy with the indentation here. :) 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on a change in pull request #2319: [MINOR] Improve to only compute fileComparisonsRDD once when 'hoodie.…

Reply via email to