boneanxs commented on code in PR #7374:
URL: https://github.com/apache/hudi/pull/7374#discussion_r1040342023


##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -150,8 +151,8 @@ object HoodieDatasetBulkInsertHelper extends Logging {
       }
 
       writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator
-    }).collect()
-    table.getContext.parallelize(writeStatuses.toList.asJava)

Review Comment:
   Yes, we can fix this by directly using `getStat`, but what if `updateIndex` 
will calculate `writeStatusList` multiple times? If we can directly dereference 
`RDD<WriteStatus>` to a list of `WriteStatus` at one feasible point(such as 
`performClusteringWithRecordsAsRow` has already done), we no need to take care 
of such issue anymore.
   
   As for the parallelism of thread pool could cause the performance issue, I 
think `performClusteringWithRecordsRDD` also has the same issue. As we might 
call `partitioner.repartitionRecords`, there could also raise a new job inside 
the Future thread such as 
https://github.com/apache/hudi/blob/ea48a85efcf8e331d0cc105d426e830b8bfe5b37/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSpatialCurveSortPartitioner.java#L66(check
 if the RDD is empty or not), or `sortBy` function in 
`RDDCustomColumnsSortPartitioner`(`sortBy` use `RangePartitoner` which needs to 
sample the rdd first to decide the ranges, which will also raise a job in the 
Future)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to