Udit Mehrotra created HUDI-656:
----------------------------------

             Summary: Write Performance - Driver spends too much time creating 
Parquet DataSource after writes
                 Key: HUDI-656
                 URL: https://issues.apache.org/jira/browse/HUDI-656
             Project: Apache Hudi (incubating)
          Issue Type: Improvement
          Components: Performance, Spark Integration
            Reporter: Udit Mehrotra


h2. Problem Statement

We have noticed this performance bottleneck at EMR, and it has been reported 
here as well [https://github.com/apache/incubator-hudi/issues/1371]

Hudi for writes through DataSource API uses 
[this|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L85]
 to create the spark relation. Here it uses HoodieSparkSqlWriter to write the 
dataframe and after it tries to 
[return|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L92]
 a relation by creating it through parquet data source 
[here|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala#L72]

In the process of creating this parquet data source, Spark creates an 
*InMemoryFileIndex* 
[here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L371]
 as part of which it performs file listing of the base path. While the listing 
itself is 
[parallelized|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L289],
 the filter that we pass which is *HoodieROTablePathFilter* is applied 
[sequentially|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L294]
 on the driver side on all the 1000s of files returned during listing. This 
part is not parallelized by spark, and it takes a lot of time probably because 
of the filters logic. This causes the driver to just spend time filtering. We 
have seen it take 10-12 minutes to do this process for just 50 partitions in 
S3, and this time is spent after the writing has finished.

Solving this will significantly reduce the writing time across all sorts of 
writes. This time is essentially getting wasted, because we do not really have 
to return a relation after the write. This relation is never really used by 
Spark either ways 
[here|https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SaveIntoDataSourceCommand.scala#L45]
 and writing process returns empty set of rows..
h2. Proposed Solution

Proposal is to return an Empty Spark relation after the write, which will cut 
down all this unnecessary time spent to create a parquet relation that never 
gets used.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to