[jira] [Commented] (HUDI-4081) Evaluate Spark SQL vs DS performance

Alexey Kudinkin (Jira) Mon, 25 Jul 2022 14:39:06 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571108#comment-17571108
 ]


Alexey Kudinkin commented on HUDI-4081:
---------------------------------------

Turns out most of the gap b/w these (~20%) is attributable to inadvertent 
dereferencing of the Dataset into RDD[Row], entailing the penalty of 
deserialization of every row. You can see that in the plans below:

 

Before:

!Screen Shot 2022-07-25 at 10.04.37 AM.png|width=267,height=457!

 

After:

!Screen Shot 2022-07-25 at 10.05.00 AM.png|width=256,height=273!

> Evaluate Spark SQL vs DS performance
> ------------------------------------
>
>                 Key: HUDI-4081
>                 URL: https://issues.apache.org/jira/browse/HUDI-4081
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: spark-sql
>            Reporter: Ethan Guo
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>             Fix For: 0.12.0
>
>         Attachments: Screen Shot 2022-07-25 at 10.04.37 AM.png, Screen Shot 
> 2022-07-25 at 10.05.00 AM.png
>
>
> In our internal benchmarks we've detected a regression in Spark SQL relative 
> to Spark DataSource integration.
> We need to investigate and subsequently address that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4081) Evaluate Spark SQL vs DS performance

Reply via email to