Re: Discussion Thread: HUDI File Listing and Query Planning Improvements

[email protected] Mon, 17 Feb 2020 21:28:14 -0800

 
Big +1 on the requirement. This would also help datasets using cloud storage by 
avoiding costly listings there.  Will look closely on the design and 
implementation in RFC to comment.
Balaji.V    On Monday, February 17, 2020, 02:06:59 PM PST, Balajee 
Nagasubramaniam <[email protected]> wrote:  
 
 Abstract

In the current implementation, HUDI Writer Client (in the write path) and
HUDI queries (through Inputformat in the read path) have to perform a “list
files” operation on the file system to get the current view of the file
system. In HDFS, listing all the files in the dataset is a NameNode
intensive operation for large data sets. For example, one of our HUDI
datasets has thousands of date partitions with each partition having
thousands of data files.

With this effort, we want to:

1. Eliminate the requirement of “list files” operation
1. This will be done by proactively maintaining metadata about the
list of files
2. Reading the file list from a single file should be faster than
large number of NameNode operations
2. Create Column Indexes for better query planning and faster lookups by
Readers
1. For a column in the dataset, min/max range per Parquet file can be
maintained.
2. Just by reading this index file, the query planning system should
be able to get the view of potential Parquet files for a range query.
3. Reading Column information from an index file should be faster
than reading the individual Parquet Footers.

This should provide the following benefits:

1. Reducing the number of file listing operations improves NameNode
scalability and reduces NameNode burden.
2. Query Planner is optimized as the planning is done by reading 1
metadata file and is mostly bounded regardless of the size of the dataset
3. Can allow for performing partition path agnostic queries in a
performant way

We seek Hudi development community's input on this proposal, to explore
this further and to implement a solution that is beneficial to the Hudi
community, meeting various use cases/requirements.

https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements

Thanks,
Balajee, Prashant and Nishith

Re: Discussion Thread: HUDI File Listing and Query Planning Improvements

Reply via email to