Timothy Brown created HUDI-9674:
-----------------------------------
Summary: Audit performance of Spark Reads in Hudi 1.X
Key: HUDI-9674
URL: https://issues.apache.org/jira/browse/HUDI-9674
Project: Apache Hudi
Issue Type: Improvement
Reporter: Timothy Brown
There are a few opportunities for better performance that we have noted while
working through other features.
1. We should make `isSplitable` in the FileFormat interface based on whether
the path is only a base file. This will allow us to parallelize reads of large
files, even in some cases of MoR real-time queries. For MoR read-optimized this
should always kick in and give users better parallelization.
2. Returning batches in spark should similarly be controlled by whether there
are only base files read. This means MoR read-optimized queries can use this
feature.
3. Explore vectorized read support for FileGroupReader. Currently the file
group reader path is always manually disabling the vectorized reader so we
should explore what it will take to get this supported.
4. Explore columnar batch support for FileGroupReader. Can we convert our
iterator of rows into a columnar batch? This would allow us to read more
optimally for tables where only a handful of file groups have log files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)