[I] Audit performance of Spark Reads in Hudi 1.X [hudi]

via GitHub Sun, 30 Nov 2025 04:52:09 -0800


hudi-bot opened a new issue, #17128:
URL: https://github.com/apache/hudi/issues/17128


   There are a few opportunities for better performance that we have noted 
while working through other features.
   
   1. We should make `isSplitable` in the FileFormat interface based on whether 
the path is only a base file. This will allow us to parallelize reads of large 
files, even in some cases of MoR real-time queries. For MoR read-optimized this 
should always kick in and give users better parallelization.
   
   2. Returning batches in spark should similarly be controlled by whether 
there are only base files read. This means MoR read-optimized queries can use 
this feature.
   
   3. Explore vectorized read support for FileGroupReader. Currently the file 
group reader path is always manually disabling the vectorized reader so we 
should explore what it will take to get this supported.
   
   4. Explore columnar batch support for FileGroupReader. Can we convert our 
iterator of rows into a columnar batch? This would allow us to read more 
optimally for tables where only a handful of file groups have log files.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-9674
   - Type: Improvement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Audit performance of Spark Reads in Hudi 1.X [hudi]

Reply via email to