+1 on this as well. Also happy to collaborate on the RFC itself and help it make progress..
>> For a column in the dataset, min/max range per Parquet file can be maintained This also (Nishith probably mentioned this) will help speed up the current bloom index's range checking.. On Tue, Feb 18, 2020 at 5:40 AM vino yang <[email protected]> wrote: > Hi Balajee, > > Big +1 for the RFC, good optimization mechanism. > > Best, > Vino > > [email protected] <[email protected]> 于2020年2月18日周二 下午1:27写道: > > > > > Big +1 on the requirement. This would also help datasets using cloud > > storage by avoiding costly listings there. Will look closely on the > design > > and implementation in RFC to comment. > > Balaji.V On Monday, February 17, 2020, 02:06:59 PM PST, Balajee > > Nagasubramaniam <[email protected]> wrote: > > > > Abstract > > > > In the current implementation, HUDI Writer Client (in the write path) and > > HUDI queries (through Inputformat in the read path) have to perform a > “list > > files” operation on the file system to get the current view of the file > > system. In HDFS, listing all the files in the dataset is a NameNode > > intensive operation for large data sets. For example, one of our HUDI > > datasets has thousands of date partitions with each partition having > > thousands of data files. > > > > With this effort, we want to: > > > > 1. Eliminate the requirement of “list files” operation > > 1. This will be done by proactively maintaining metadata about the > > list of files > > 2. Reading the file list from a single file should be faster than > > large number of NameNode operations > > 2. Create Column Indexes for better query planning and faster lookups > by > > Readers > > 1. For a column in the dataset, min/max range per Parquet file can > be > > maintained. > > 2. Just by reading this index file, the query planning system > should > > be able to get the view of potential Parquet files for a range > query. > > 3. Reading Column information from an index file should be faster > > than reading the individual Parquet Footers. > > > > This should provide the following benefits: > > > > 1. Reducing the number of file listing operations improves NameNode > > scalability and reduces NameNode burden. > > 2. Query Planner is optimized as the planning is done by reading 1 > > metadata file and is mostly bounded regardless of the size of the > dataset > > 3. Can allow for performing partition path agnostic queries in a > > performant way > > > > > > We seek Hudi development community's input on this proposal, to explore > > this further and to implement a solution that is beneficial to the Hudi > > community, meeting various use cases/requirements. > > > > > > > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements > > > > Thanks, > > Balajee, Prashant and Nishith >
