Zouxxyy commented on code in PR #7374:
URL: https://github.com/apache/hudi/pull/7374#discussion_r1038815845
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -150,8 +151,8 @@ object HoodieDatasetBulkInsertHelper extends Logging {
}
writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator
- }).collect()
- table.getContext.parallelize(writeStatuses.toList.asJava)
Review Comment:
@alexeykudinkin
Currently, collect is used internally in bulk insert for [[Dataset] when
execute clusting, which cause
1. A single spark job is generated within it, and if there are many clusting
groups, too many spark jobs will be generated, which makes the spark app not
simple enough
2. Because Executor is not explicitly specified when submiting spark Jobs
through`CompletableFuture. supplyAsync`, the number of spark jobs that can be
executed simultaneously is limited to the number of CPU cores of the driver,
which may cause a performance bottleneck
You can see https://issues.apache.org/jira/browse/HUDI-5327, I introduced
the case I encountered in it
cc @boneanxs
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]