Re: Discussion Thread: HUDI File Listing and Query Planning Improvements

vino yang Tue, 18 Feb 2020 05:41:07 -0800

Hi Balajee,

Big +1 for the RFC, good optimization mechanism.


Best,
Vino

[email protected] <[email protected]> 于2020年2月18日周二 下午1:27写道：

>
> Big +1 on the requirement. This would also help datasets using cloud
> storage by avoiding costly listings there.  Will look closely on the design
> and implementation in RFC to comment.
> Balaji.V    On Monday, February 17, 2020, 02:06:59 PM PST, Balajee
> Nagasubramaniam <[email protected]> wrote:
>
>  Abstract
>
> In the current implementation, HUDI Writer Client (in the write path) and
> HUDI queries (through Inputformat in the read path) have to perform a “list
> files” operation on the file system to get the current view of the file
> system.  In HDFS, listing all the files in the dataset is a NameNode
> intensive operation for large data sets. For example, one of our HUDI
> datasets has thousands of date partitions with each partition having
> thousands of data files.
>
> With this effort, we want to:
>
>   1. Eliminate the requirement of “list files” operation
>       1. This will be done by proactively maintaining metadata about the
>       list of files
>       2. Reading the file list from a single file should be faster than
>       large number of NameNode operations
>   2. Create Column Indexes for better query planning and faster lookups by
>   Readers
>       1. For a column in the dataset, min/max range per Parquet file can be
>       maintained.
>       2. Just by reading this index file, the query planning system should
>       be able to get the view of potential Parquet files for a range query.
>       3. Reading Column information from an index file should be faster
>       than reading the individual Parquet Footers.
>
> This should provide the following benefits:
>
>   1. Reducing the number of file listing operations improves NameNode
>   scalability and reduces NameNode burden.
>   2. Query Planner is optimized as the planning is done by reading 1
>   metadata file and is mostly bounded regardless of the size of the dataset
>   3. Can allow for performing partition path agnostic queries in a
>   performant way
>
>
> We seek Hudi development community's input on this proposal, to explore
> this further and to implement a solution that is beneficial to the Hudi
> community, meeting various use cases/requirements.
>
>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+and+Query+Planning+Improvements
>
> Thanks,
> Balajee, Prashant and Nishith

Re: Discussion Thread: HUDI File Listing and Query Planning Improvements

Reply via email to