hudi-bot opened a new issue, #17128: URL: https://github.com/apache/hudi/issues/17128
There are a few opportunities for better performance that we have noted while working through other features. 1. We should make `isSplitable` in the FileFormat interface based on whether the path is only a base file. This will allow us to parallelize reads of large files, even in some cases of MoR real-time queries. For MoR read-optimized this should always kick in and give users better parallelization. 2. Returning batches in spark should similarly be controlled by whether there are only base files read. This means MoR read-optimized queries can use this feature. 3. Explore vectorized read support for FileGroupReader. Currently the file group reader path is always manually disabling the vectorized reader so we should explore what it will take to get this supported. 4. Explore columnar batch support for FileGroupReader. Can we convert our iterator of rows into a columnar batch? This would allow us to read more optimally for tables where only a handful of file groups have log files. ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-9674 - Type: Improvement -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
