Hi Balajee, Big +1 for the RFC, good optimization mechanism.
Best, Vino [email protected] <[email protected]> 于2020年2月18日周二 下午1:27写道: > > Big +1 on the requirement. This would also help datasets using cloud > storage by avoiding costly listings there. Will look closely on the design > and implementation in RFC to comment. > Balaji.V On Monday, February 17, 2020, 02:06:59 PM PST, Balajee > Nagasubramaniam <[email protected]> wrote: > > Abstract > > In the current implementation, HUDI Writer Client (in the write path) and > HUDI queries (through Inputformat in the read path) have to perform a “list > files” operation on the file system to get the current view of the file > system. In HDFS, listing all the files in the dataset is a NameNode > intensive operation for large data sets. For example, one of our HUDI > datasets has thousands of date partitions with each partition having > thousands of data files. > > With this effort, we want to: > > 1. Eliminate the requirement of “list files” operation > 1. This will be done by proactively maintaining metadata about the > list of files > 2. Reading the file list from a single file should be faster than > large number of NameNode operations > 2. Create Column Indexes for better query planning and faster lookups by > Readers > 1. For a column in the dataset, min/max range per Parquet file can be > maintained. > 2. Just by reading this index file, the query planning system should > be able to get the view of potential Parquet files for a range query. > 3. Reading Column information from an index file should be faster > than reading the individual Parquet Footers. > > This should provide the following benefits: > > 1. Reducing the number of file listing operations improves NameNode > scalability and reduces NameNode burden. > 2. Query Planner is optimized as the planning is done by reading 1 > metadata file and is mostly bounded regardless of the size of the dataset > 3. Can allow for performing partition path agnostic queries in a > performant way > > > We seek Hudi development community's input on this proposal, to explore > this further and to implement a solution that is beneficial to the Hudi > community, meeting various use cases/requirements. > > > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements > > Thanks, > Balajee, Prashant and Nishith
