[
https://issues.apache.org/jira/browse/HUDI-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245964#comment-17245964
]
Christopher Dedels edited comment on HUDI-1425 at 12/9/20, 1:30 PM:
--------------------------------------------------------------------
+1 for this issue. In addition to isEmpty adding unnecessary time to our spark
applications, we've discovered that it can actually cause the application to
slow. For applications that have a very complex transformation prior to merge,
it can take some time to execute this test.
While caching the dataframe is a potential workaround for this issue, it does
have the unfortunate side effect of the implementer needing to know Hudi's
implementation details in order to add the cache optimize their graph.
was (Author: bgt-cdedels):
+1 for this issue. In addition to isEmpty adding unnecessary time to our spark
applications, we've discovered that it can actually cause the application to
crash. This is because the isEmpty runs in a single task. For applications
that have a very complex transformation followed by a merge, it can very easily
overwhelm the executor and cause OOM and/or disk space issues.
We are attempting to cache the dataframe prior to merge as a workaround. It is
our hope that this allows the entire cluster to compute the cached dataframe,
reducing the effort for the isEmpty test.
> Performance loss with the additional hoodieRecords.isEmpty() in
> HoodieSparkSqlWriter#write
> ------------------------------------------------------------------------------------------
>
> Key: HUDI-1425
> URL: https://issues.apache.org/jira/browse/HUDI-1425
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Spark Integration
> Reporter: pengzhiwei
> Assignee: pengzhiwei
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: 截屏2020-11-30 下午9.47.55.png
>
>
> Currently in HoodieSparkSqlWriter#write, there is a _isEmpty()_ test for
> _hoodieRecords._ This may be a heavy operator in the case when the
> _hoodieRecords_ contains complex RDD operate.
> !截屏2020-11-30 下午9.47.55.png|width=1255,height=161!
> IMO this test does nothing to do with the performance improve,but rather
> affects performance.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)