[jira] [Created] (HUDI-9674) Audit performance of Spark Reads in Hudi 1.X

Timothy Brown (Jira) Thu, 31 Jul 2025 06:43:04 -0700

Timothy Brown created HUDI-9674:
-----------------------------------

             Summary: Audit performance of Spark Reads in Hudi 1.X
                 Key: HUDI-9674
                 URL: https://issues.apache.org/jira/browse/HUDI-9674
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Timothy Brown



There are a few opportunities for better performance that we have noted while 
working through other features.

1. We should make `isSplitable` in the FileFormat interface based on whether 
the path is only a base file. This will allow us to parallelize reads of large 
files, even in some cases of MoR real-time queries. For MoR read-optimized this 
should always kick in and give users better parallelization.

2. Returning batches in spark should similarly be controlled by whether there 
are only base files read. This means MoR read-optimized queries can use this 
feature.

3. Explore vectorized read support for FileGroupReader. Currently the file 
group reader path is always manually disabling the vectorized reader so we 
should explore what it will take to get this supported.

4. Explore columnar batch support for FileGroupReader. Can we convert our 
iterator of rows into a columnar batch? This would allow us to read more 
optimally for tables where only a handful of file groups have log files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-9674) Audit performance of Spark Reads in Hudi 1.X

Reply via email to