Big +1 on the requirement. This would also help datasets using cloud storage by 
avoiding costly listings there.  Will look closely on the design and 
implementation in RFC to comment.
Balaji.V    On Monday, February 17, 2020, 02:06:59 PM PST, Balajee 
Nagasubramaniam <[email protected]> wrote:  
 
 Abstract

In the current implementation, HUDI Writer Client (in the write path) and
HUDI queries (through Inputformat in the read path) have to perform a “list
files” operation on the file system to get the current view of the file
system.  In HDFS, listing all the files in the dataset is a NameNode
intensive operation for large data sets. For example, one of our HUDI
datasets has thousands of date partitions with each partition having
thousands of data files.

With this effort, we want to:

  1. Eliminate the requirement of “list files” operation
      1. This will be done by proactively maintaining metadata about the
      list of files
      2. Reading the file list from a single file should be faster than
      large number of NameNode operations
  2. Create Column Indexes for better query planning and faster lookups by
  Readers
      1. For a column in the dataset, min/max range per Parquet file can be
      maintained.
      2. Just by reading this index file, the query planning system should
      be able to get the view of potential Parquet files for a range query.
      3. Reading Column information from an index file should be faster
      than reading the individual Parquet Footers.

This should provide the following benefits:

  1. Reducing the number of file listing operations improves NameNode
  scalability and reduces NameNode burden.
  2. Query Planner is optimized as the planning is done by reading 1
  metadata file and is mostly bounded regardless of the size of the dataset
  3. Can allow for performing partition path agnostic queries in a
  performant way


We seek Hudi development community's input on this proposal, to explore
this further and to implement a solution that is beneficial to the Hudi
community, meeting various use cases/requirements.

https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements

Thanks,
Balajee, Prashant and Nishith  

Reply via email to