Zouxxyy commented on code in PR #7374:
URL: https://github.com/apache/hudi/pull/7374#discussion_r1038815845


##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -150,8 +151,8 @@ object HoodieDatasetBulkInsertHelper extends Logging {
       }
 
       writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator
-    }).collect()
-    table.getContext.parallelize(writeStatuses.toList.asJava)

Review Comment:
   @alexeykudinkin 
   
   Currently, collect is used internally in bulk insert for [[Dataset] when 
execute clusting, which cause
   
   1. A single spark job is generated within it, and if there are many clusting 
groups, too many spark jobs will be generated, which makes the spark app not 
simple enough
   2. Because Executor is not explicitly specified when submiting spark Jobs 
through`CompletableFuture. supplyAsync`, the number of spark jobs that can be 
executed simultaneously is limited to the number of CPU cores of the driver, 
which may cause a performance bottleneck
   
   In addition, `performClusteringWithRecordsRDD` does not have the above 
problems, because it does not use collect internally, so I just keep their 
behavior consistent
   
   You can see https://issues.apache.org/jira/browse/HUDI-5327, I introduced 
the case I encountered in it
   
   cc @boneanxs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to