[GitHub] [hudi] vinothchandar commented on pull request #5470: [HUDI-3993] Replacing UDF in Bulk Insert w/ RDD transformation


vinothchandar commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1194344129

@alexeykudinkin Want to get my understanding straight, as well make sure we
have an explanation for how these factors play out with the new changes.

1. The original row writer impl originated in overhead from doing
`df.queryExecution.toRdd`
[here](https://github.com/apache/hudi/blob/622d27a099f5dec96f992fd423b666083da4b24a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala#L160),
done before Avro record conversion. We traced this into a code in Spark, that
makes an additional pass (almost) to materialize the Rows with a schema to be
used by the iterator.

2. I see that in 0.11.1 we were just
[processing](https://github.com/apache/hudi/blob/622d27a099f5dec96f992fd423b666083da4b24a/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L74)
the dataframe as `DataSet<Row>` and ergo the use of UDFs for the other
functionality. This is what's been fixed in 0.12 now.

I want to understand how we are avoiding the RDD conversion costs, in the
current approach? This cost becomes obvious when you do records with large
number of columns (due to overhead per record)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to