[
https://issues.apache.org/jira/browse/HUDI-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571108#comment-17571108
]
Alexey Kudinkin commented on HUDI-4081:
---------------------------------------
Turns out most of the gap b/w these (~20%) is attributable to inadvertent
dereferencing of the Dataset into RDD[Row], entailing the penalty of
deserialization of every row. You can see that in the plans below:
Before:
!Screen Shot 2022-07-25 at 10.04.37 AM.png|width=267,height=457!
After:
!Screen Shot 2022-07-25 at 10.05.00 AM.png|width=256,height=273!
> Evaluate Spark SQL vs DS performance
> ------------------------------------
>
> Key: HUDI-4081
> URL: https://issues.apache.org/jira/browse/HUDI-4081
> Project: Apache Hudi
> Issue Type: Task
> Components: spark-sql
> Reporter: Ethan Guo
> Assignee: Alexey Kudinkin
> Priority: Blocker
> Fix For: 0.12.0
>
> Attachments: Screen Shot 2022-07-25 at 10.04.37 AM.png, Screen Shot
> 2022-07-25 at 10.05.00 AM.png
>
>
> In our internal benchmarks we've detected a regression in Spark SQL relative
> to Spark DataSource integration.
> We need to investigate and subsequently address that.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)