Discussion Thread: HUDI File Listing and Query Planning Improvements

Balajee Nagasubramaniam Mon, 17 Feb 2020 14:07:13 -0800

Abstract

In the current implementation, HUDI Writer Client (in the write path) and
HUDI queries (through Inputformat in the read path) have to perform a “list
files” operation on the file system to get the current view of the file
system.  In HDFS, listing all the files in the dataset is a NameNode
intensive operation for large data sets. For example, one of our HUDI
datasets has thousands of date partitions with each partition having
thousands of data files.


With this effort, we want to:

   1. Eliminate the requirement of “list files” operation
      1. This will be done by proactively maintaining metadata about the
      list of files
      2. Reading the file list from a single file should be faster than
      large number of NameNode operations
   2. Create Column Indexes for better query planning and faster lookups by
   Readers
      1. For a column in the dataset, min/max range per Parquet file can be
      maintained.
      2. Just by reading this index file, the query planning system should
      be able to get the view of potential Parquet files for a range query.
      3. Reading Column information from an index file should be faster
      than reading the individual Parquet Footers.

This should provide the following benefits:

   1. Reducing the number of file listing operations improves NameNode
   scalability and reduces NameNode burden.
   2. Query Planner is optimized as the planning is done by reading 1
   metadata file and is mostly bounded regardless of the size of the dataset
   3. Can allow for performing partition path agnostic queries in a
   performant way


We seek Hudi development community's input on this proposal, to explore
this further and to implement a solution that is beneficial to the Hudi
community, meeting various use cases/requirements.

https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements

Thanks,
Balajee, Prashant and Nishith

Discussion Thread: HUDI File Listing and Query Planning Improvements

Reply via email to