[
https://issues.apache.org/jira/browse/HUDI-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245964#comment-17245964
]
Christopher Dedels commented on HUDI-1425:
------------------------------------------
+1 for this issue. In addition to isEmpty adding unnecessary time to our spark
applications, we've discovered that it can actually cause the application to
crash. This is because the isEmpty runs in a single task. For applications
that have a very complex transformation followed by a merge, it can very easily
overwhelm the executor and cause OOM and/or disk space issues.
We are attempting to cache the dataframe prior to merge as a workaround. It is
our hope that this allows the entire cluster to compute the cached dataframe,
reducing the effort for the isEmpty test.
> Performance loss with the additional hoodieRecords.isEmpty() in
> HoodieSparkSqlWriter#write
> ------------------------------------------------------------------------------------------
>
> Key: HUDI-1425
> URL: https://issues.apache.org/jira/browse/HUDI-1425
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Spark Integration
> Reporter: pengzhiwei
> Assignee: pengzhiwei
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: 截屏2020-11-30 下午9.47.55.png
>
>
> Currently in HoodieSparkSqlWriter#write, there is a _isEmpty()_ test for
> _hoodieRecords._ This may be a heavy operator in the case when the
> _hoodieRecords_ contains complex RDD operate.
> !截屏2020-11-30 下午9.47.55.png|width=1255,height=161!
> IMO this test does nothing to do with the performance improve,but rather
> affects performance.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)