+1 from me, query improvement will indeed make hudi more advanced. Vinoth Chandar <[email protected]> 于2020年2月19日周三 上午3:17写道:
> +1 on this as well. Also happy to collaborate on the RFC itself and help it > make progress.. > > >> For a column in the dataset, min/max range per Parquet file can be > maintained > This also (Nishith probably mentioned this) will help speed up the current > bloom index's range checking.. > > On Tue, Feb 18, 2020 at 5:40 AM vino yang <[email protected]> wrote: > > > Hi Balajee, > > > > Big +1 for the RFC, good optimization mechanism. > > > > Best, > > Vino > > > > [email protected] <[email protected]> 于2020年2月18日周二 下午1:27写道: > > > > > > > > Big +1 on the requirement. This would also help datasets using cloud > > > storage by avoiding costly listings there. Will look closely on the > > design > > > and implementation in RFC to comment. > > > Balaji.V On Monday, February 17, 2020, 02:06:59 PM PST, Balajee > > > Nagasubramaniam <[email protected]> wrote: > > > > > > Abstract > > > > > > In the current implementation, HUDI Writer Client (in the write path) > and > > > HUDI queries (through Inputformat in the read path) have to perform a > > “list > > > files” operation on the file system to get the current view of the file > > > system. In HDFS, listing all the files in the dataset is a NameNode > > > intensive operation for large data sets. For example, one of our HUDI > > > datasets has thousands of date partitions with each partition having > > > thousands of data files. > > > > > > With this effort, we want to: > > > > > > 1. Eliminate the requirement of “list files” operation > > > 1. This will be done by proactively maintaining metadata about > the > > > list of files > > > 2. Reading the file list from a single file should be faster than > > > large number of NameNode operations > > > 2. Create Column Indexes for better query planning and faster lookups > > by > > > Readers > > > 1. For a column in the dataset, min/max range per Parquet file > can > > be > > > maintained. > > > 2. Just by reading this index file, the query planning system > > should > > > be able to get the view of potential Parquet files for a range > > query. > > > 3. Reading Column information from an index file should be faster > > > than reading the individual Parquet Footers. > > > > > > This should provide the following benefits: > > > > > > 1. Reducing the number of file listing operations improves NameNode > > > scalability and reduces NameNode burden. > > > 2. Query Planner is optimized as the planning is done by reading 1 > > > metadata file and is mostly bounded regardless of the size of the > > dataset > > > 3. Can allow for performing partition path agnostic queries in a > > > performant way > > > > > > > > > We seek Hudi development community's input on this proposal, to explore > > > this further and to implement a solution that is beneficial to the Hudi > > > community, meeting various use cases/requirements. > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements > > > > > > Thanks, > > > Balajee, Prashant and Nishith > > >
