[
https://issues.apache.org/jira/browse/HUDI-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-656:
--------------------------------
Fix Version/s: 0.6.0
> Write Performance - Driver spends too much time creating Parquet DataSource
> after writes
> ----------------------------------------------------------------------------------------
>
> Key: HUDI-656
> URL: https://issues.apache.org/jira/browse/HUDI-656
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Components: Performance, Spark Integration
> Reporter: Udit Mehrotra
> Assignee: Udit Mehrotra
> Priority: Major
> Fix For: 0.6.0
>
>
> h2. Problem Statement
> We have noticed this performance bottleneck at EMR, and it has been reported
> here as well [https://github.com/apache/incubator-hudi/issues/1371]
> Hudi for writes through DataSource API uses
> [this|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L85]
> to create the spark relation. Here it uses HoodieSparkSqlWriter to write the
> dataframe and after it tries to
> [return|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L92]
> a relation by creating it through parquet data source
> [here|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L72]
> In the process of creating this parquet data source, Spark creates an
> *InMemoryFileIndex*
> [here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L371]
> as part of which it performs file listing of the base path. While the
> listing itself is
> [parallelized|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L289],
> the filter that we pass which is *HoodieROTablePathFilter* is applied
> [sequentially|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L294]
> on the driver side on all the 1000s of files returned during listing. This
> part is not parallelized by spark, and it takes a lot of time probably
> because of the filters logic. This causes the driver to just spend time
> filtering. We have seen it take 10-12 minutes to do this process for just 50
> partitions in S3, and this time is spent after the writing has finished.
> Solving this will significantly reduce the writing time across all sorts of
> writes. This time is essentially getting wasted, because we do not really
> have to return a relation after the write. This relation is never really used
> by Spark either ways
> [here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala#L45]
> and writing process returns empty set of rows..
> h2. Proposed Solution
> Proposal is to return an Empty Spark relation after the write, which will cut
> down all this unnecessary time spent to create a parquet relation that never
> gets used.
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)