Abstract
In the current implementation, HUDI Writer Client (in the write path) and
HUDI queries (through Inputformat in the read path) have to perform a “list
files” operation on the file system to get the current view of the file
system. In HDFS, listing all the files in the dataset is a NameNode
intensive operation for large data sets. For example, one of our HUDI
datasets has thousands of date partitions with each partition having
thousands of data files.
With this effort, we want to:
1. Eliminate the requirement of “list files” operation
1. This will be done by proactively maintaining metadata about the
list of files
2. Reading the file list from a single file should be faster than
large number of NameNode operations
2. Create Column Indexes for better query planning and faster lookups by
Readers
1. For a column in the dataset, min/max range per Parquet file can be
maintained.
2. Just by reading this index file, the query planning system should
be able to get the view of potential Parquet files for a range query.
3. Reading Column information from an index file should be faster
than reading the individual Parquet Footers.
This should provide the following benefits:
1. Reducing the number of file listing operations improves NameNode
scalability and reduces NameNode burden.
2. Query Planner is optimized as the planning is done by reading 1
metadata file and is mostly bounded regardless of the size of the dataset
3. Can allow for performing partition path agnostic queries in a
performant way
We seek Hudi development community's input on this proposal, to explore
this further and to implement a solution that is beneficial to the Hudi
community, meeting various use cases/requirements.
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements
Thanks,
Balajee, Prashant and Nishith