[GitHub] [hudi] boneanxs commented on a diff in pull request #7374: [HUDI-5327] Fix spark stages when using row writer

GitBox Mon, 05 Dec 2022 02:50:34 -0800


boneanxs commented on code in PR #7374:
URL: https://github.com/apache/hudi/pull/7374#discussion_r1039436948



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -150,8 +151,8 @@ object HoodieDatasetBulkInsertHelper extends Logging {
       }
 
       writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator
-    }).collect()
-    table.getContext.parallelize(writeStatuses.toList.asJava)

Review Comment:
   Hey @Zouxxyy, Thanks for raising this issue! It's so nice to see you're 
trying this feature!
   
   The reason to collect the data here is that `HoodieData<WriteStatus>` will 
be used multiple times after `performClustering`, I recall there is an 
`isEmpty` check could take lots of time, so here we directly convert to a list 
of WriteStatus, which will reduce the time.
   
   For the second issue, I noticed this and raised a pr to fix it: #7343, will 
that address your problem? Feel free to review it!
   
   I think `performClusteringWithRecordsRDD` also has the same issue such as 
using `RDDSpatialCurveSortPartitioner` to optimize data layout, it will call 
`RDD.isEmpty`, which will raise a new job.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] boneanxs commented on a diff in pull request #7374: [HUDI-5327] Fix spark stages when using row writer

Reply via email to