boneanxs commented on code in PR #7374:
URL: https://github.com/apache/hudi/pull/7374#discussion_r1040342023
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -150,8 +151,8 @@ object HoodieDatasetBulkInsertHelper extends Logging {
}
writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator
- }).collect()
- table.getContext.parallelize(writeStatuses.toList.asJava)
Review Comment:
Yes, we can fix this by directly using `getStat`, but what if `updateIndex`
will calculate `writeStatusList` multiple times? If we can directly dereference
`RDD<WriteStatus>` to a list of `WriteStatus` at one feasible point(such as
`performClusteringWithRecordsAsRow` has already done), we no need to take care
of such issue anymore.
As for the parallelism of thread pool could cause the performance issue, I
think `performClusteringWithRecordsRDD` also has the same issue. As we might
call `partitioner.repartitionRecords`, there could also raise a new job inside
the Future thread such as
https://github.com/apache/hudi/blob/ea48a85efcf8e331d0cc105d426e830b8bfe5b37/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSpatialCurveSortPartitioner.java#L66(check
if the RDD is empty or not), or `sortBy` function in
`RDDCustomColumnsSortPartitioner`(`sortBy` use `RangePartitoner` which needs to
sample the rdd first to decide the ranges, which will also raise a job in the
Future)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]