Big +1 on the requirement. This would also help datasets using cloud storage by avoiding costly listings there. Will look closely on the design and implementation in RFC to comment. Balaji.V On Monday, February 17, 2020, 02:06:59 PM PST, Balajee Nagasubramaniam <[email protected]> wrote: Abstract
In the current implementation, HUDI Writer Client (in the write path) and HUDI queries (through Inputformat in the read path) have to perform a “list files” operation on the file system to get the current view of the file system. In HDFS, listing all the files in the dataset is a NameNode intensive operation for large data sets. For example, one of our HUDI datasets has thousands of date partitions with each partition having thousands of data files. With this effort, we want to: 1. Eliminate the requirement of “list files” operation 1. This will be done by proactively maintaining metadata about the list of files 2. Reading the file list from a single file should be faster than large number of NameNode operations 2. Create Column Indexes for better query planning and faster lookups by Readers 1. For a column in the dataset, min/max range per Parquet file can be maintained. 2. Just by reading this index file, the query planning system should be able to get the view of potential Parquet files for a range query. 3. Reading Column information from an index file should be faster than reading the individual Parquet Footers. This should provide the following benefits: 1. Reducing the number of file listing operations improves NameNode scalability and reduces NameNode burden. 2. Query Planner is optimized as the planning is done by reading 1 metadata file and is mostly bounded regardless of the size of the dataset 3. Can allow for performing partition path agnostic queries in a performant way We seek Hudi development community's input on this proposal, to explore this further and to implement a solution that is beneficial to the Hudi community, meeting various use cases/requirements. https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements Thanks, Balajee, Prashant and Nishith
