Re: Discussion Thread: HUDI File Listing and Query Planning Improvements

leesf Tue, 18 Feb 2020 17:07:01 -0800

+1 from me, query improvement will indeed make hudi more advanced.

Vinoth Chandar <[email protected]> 于2020年2月19日周三 上午3:17写道：


> +1 on this as well. Also happy to collaborate on the RFC itself and help it
> make progress..
>
> >>  For a column in the dataset, min/max range per Parquet file can be
> maintained
> This also (Nishith probably mentioned this) will help speed up the current
> bloom index's range checking..
>
> On Tue, Feb 18, 2020 at 5:40 AM vino yang <[email protected]> wrote:
>
> > Hi Balajee,
> >
> > Big +1 for the RFC, good optimization mechanism.
> >
> > Best,
> > Vino
> >
> > [email protected] <[email protected]> 于2020年2月18日周二 下午1:27写道：
> >
> > >
> > > Big +1 on the requirement. This would also help datasets using cloud
> > > storage by avoiding costly listings there.  Will look closely on the
> > design
> > > and implementation in RFC to comment.
> > > Balaji.V    On Monday, February 17, 2020, 02:06:59 PM PST, Balajee
> > > Nagasubramaniam <[email protected]> wrote:
> > >
> > >  Abstract
> > >
> > > In the current implementation, HUDI Writer Client (in the write path)
> and
> > > HUDI queries (through Inputformat in the read path) have to perform a
> > “list
> > > files” operation on the file system to get the current view of the file
> > > system.  In HDFS, listing all the files in the dataset is a NameNode
> > > intensive operation for large data sets. For example, one of our HUDI
> > > datasets has thousands of date partitions with each partition having
> > > thousands of data files.
> > >
> > > With this effort, we want to:
> > >
> > >   1. Eliminate the requirement of “list files” operation
> > >       1. This will be done by proactively maintaining metadata about
> the
> > >       list of files
> > >       2. Reading the file list from a single file should be faster than
> > >       large number of NameNode operations
> > >   2. Create Column Indexes for better query planning and faster lookups
> > by
> > >   Readers
> > >       1. For a column in the dataset, min/max range per Parquet file
> can
> > be
> > >       maintained.
> > >       2. Just by reading this index file, the query planning system
> > should
> > >       be able to get the view of potential Parquet files for a range
> > query.
> > >       3. Reading Column information from an index file should be faster
> > >       than reading the individual Parquet Footers.
> > >
> > > This should provide the following benefits:
> > >
> > >   1. Reducing the number of file listing operations improves NameNode
> > >   scalability and reduces NameNode burden.
> > >   2. Query Planner is optimized as the planning is done by reading 1
> > >   metadata file and is mostly bounded regardless of the size of the
> > dataset
> > >   3. Can allow for performing partition path agnostic queries in a
> > >   performant way
> > >
> > >
> > > We seek Hudi development community's input on this proposal, to explore
> > > this further and to implement a solution that is beneficial to the Hudi
> > > community, meeting various use cases/requirements.
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements
> > >
> > > Thanks,
> > > Balajee, Prashant and Nishith
> >
>

Re: Discussion Thread: HUDI File Listing and Query Planning Improvements

Reply via email to