Re: Discussion Thread: HUDI File Listing and Query Planning Improvements

Vinoth Chandar Tue, 18 Feb 2020 11:17:51 -0800

+1 on this as well. Also happy to collaborate on the RFC itself and help it
make progress..


>>  For a column in the dataset, min/max range per Parquet file can be
maintained
This also (Nishith probably mentioned this) will help speed up the current
bloom index's range checking..

On Tue, Feb 18, 2020 at 5:40 AM vino yang <[email protected]> wrote:

> Hi Balajee,
>
> Big +1 for the RFC, good optimization mechanism.
>
> Best,
> Vino
>
> [email protected] <[email protected]> 于2020年2月18日周二 下午1:27写道：
>
> >
> > Big +1 on the requirement. This would also help datasets using cloud
> > storage by avoiding costly listings there.  Will look closely on the
> design
> > and implementation in RFC to comment.
> > Balaji.V    On Monday, February 17, 2020, 02:06:59 PM PST, Balajee
> > Nagasubramaniam <[email protected]> wrote:
> >
> >  Abstract
> >
> > In the current implementation, HUDI Writer Client (in the write path) and
> > HUDI queries (through Inputformat in the read path) have to perform a
> “list
> > files” operation on the file system to get the current view of the file
> > system.  In HDFS, listing all the files in the dataset is a NameNode
> > intensive operation for large data sets. For example, one of our HUDI
> > datasets has thousands of date partitions with each partition having
> > thousands of data files.
> >
> > With this effort, we want to:
> >
> >   1. Eliminate the requirement of “list files” operation
> >       1. This will be done by proactively maintaining metadata about the
> >       list of files
> >       2. Reading the file list from a single file should be faster than
> >       large number of NameNode operations
> >   2. Create Column Indexes for better query planning and faster lookups
> by
> >   Readers
> >       1. For a column in the dataset, min/max range per Parquet file can
> be
> >       maintained.
> >       2. Just by reading this index file, the query planning system
> should
> >       be able to get the view of potential Parquet files for a range
> query.
> >       3. Reading Column information from an index file should be faster
> >       than reading the individual Parquet Footers.
> >
> > This should provide the following benefits:
> >
> >   1. Reducing the number of file listing operations improves NameNode
> >   scalability and reduces NameNode burden.
> >   2. Query Planner is optimized as the planning is done by reading 1
> >   metadata file and is mostly bounded regardless of the size of the
> dataset
> >   3. Can allow for performing partition path agnostic queries in a
> >   performant way
> >
> >
> > We seek Hudi development community's input on this proposal, to explore
> > this further and to implement a solution that is beneficial to the Hudi
> > community, meeting various use cases/requirements.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements
> >
> > Thanks,
> > Balajee, Prashant and Nishith
>

Re: Discussion Thread: HUDI File Listing and Query Planning Improvements

Reply via email to