vinothchandar commented on PR #5470: URL: https://github.com/apache/hudi/pull/5470#issuecomment-1194344129
@alexeykudinkin Want to get my understanding straight, as well make sure we have an explanation for how these factors play out with the new changes. 1. The original row writer impl originated in overhead from doing `df.queryExecution.toRdd` [here](https://github.com/apache/hudi/blob/622d27a099f5dec96f992fd423b666083da4b24a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala#L160), done before Avro record conversion. We traced this into a code in Spark, that makes an additional pass (almost) to materialize the Rows with a schema to be used by the iterator. 2. I see that in 0.11.1 we were just [processing](https://github.com/apache/hudi/blob/622d27a099f5dec96f992fd423b666083da4b24a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L74) the dataframe as `DataSet<Row>` and ergo the use of UDFs for the other functionality. This is what's been fixed in 0.12 now. I want to understand how we are avoiding the RDD conversion costs, in the current approach? This cost becomes obvious when you do records with large number of columns (due to overhead per record) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
