vinothchandar commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1194344129

   @alexeykudinkin Want to get my understanding straight, as well make sure we 
have an explanation for how these factors play out with the new changes. 
   
   
   1. The original row writer impl originated in overhead from doing 
`df.queryExecution.toRdd` 
[here](https://github.com/apache/hudi/blob/622d27a099f5dec96f992fd423b666083da4b24a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala#L160),
 done before Avro record conversion. We traced this into a code in Spark, that 
makes an additional pass (almost) to materialize the Rows with a schema to be 
used by the iterator.
   
   2. I see that in 0.11.1 we were just 
[processing](https://github.com/apache/hudi/blob/622d27a099f5dec96f992fd423b666083da4b24a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L74)
 the dataframe as `DataSet<Row>` and ergo the use of UDFs for the other 
functionality. This is what's been fixed in 0.12 now. 
   
   
   I want to understand how we are avoiding the RDD conversion costs, in the 
current approach? This cost becomes obvious when you do records with large 
number of columns (due to overhead per record) 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to