Udit Mehrotra created HUDI-656:
----------------------------------
Summary: Write Performance - Driver spends too much time creating
Parquet DataSource after writes
Key: HUDI-656
URL: https://issues.apache.org/jira/browse/HUDI-656
Project: Apache Hudi (incubating)
Issue Type: Improvement
Components: Performance, Spark Integration
Reporter: Udit Mehrotra
h2. Problem Statement
We have noticed this performance bottleneck at EMR, and it has been reported
here as well [https://github.com/apache/incubator-hudi/issues/1371]
Hudi for writes through DataSource API uses
[this|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L85]
to create the spark relation. Here it uses HoodieSparkSqlWriter to write the
dataframe and after it tries to
[return|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L92]
a relation by creating it through parquet data source
[here|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L72]
In the process of creating this parquet data source, Spark creates an
*InMemoryFileIndex*
[here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L371]
as part of which it performs file listing of the base path. While the listing
itself is
[parallelized|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L289],
the filter that we pass which is *HoodieROTablePathFilter* is applied
[sequentially|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L294]
on the driver side on all the 1000s of files returned during listing. This
part is not parallelized by spark, and it takes a lot of time probably because
of the filters logic. This causes the driver to just spend time filtering. We
have seen it take 10-12 minutes to do this process for just 50 partitions in
S3, and this time is spent after the writing has finished.
Solving this will significantly reduce the writing time across all sorts of
writes. This time is essentially getting wasted, because we do not really have
to return a relation after the write. This relation is never really used by
Spark either ways
[here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala#L45]
and writing process returns empty set of rows..
h2. Proposed Solution
Proposal is to return an Empty Spark relation after the write, which will cut
down all this unnecessary time spent to create a parquet relation that never
gets used.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)