boneanxs commented on code in PR #7374:
URL: https://github.com/apache/hudi/pull/7374#discussion_r1039436948
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -150,8 +151,8 @@ object HoodieDatasetBulkInsertHelper extends Logging {
}
writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator
- }).collect()
- table.getContext.parallelize(writeStatuses.toList.asJava)
Review Comment:
Hey @Zouxxyy, Thanks for raising this issue!
The reason to collect the data here is that `HoodieData<WriteStatus>` will
be used multiple times after `performClustering`, I recall there is an
`isEmpty` check could take lots of time, so here we directly convert to a list
of WriteStatus, which will reduce the time.
For the second issue, I noticed this and raised a pr to fix it: #7343, will
that address your problem? Feel free to review it!
I think `performClusteringWithRecordsRDD` also has the same issue such as
using `RDDSpatialCurveSortPartitioner` to optimize data layout, it will call
`RDD.isEmpty`, which will raise a new job.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]