date:20220810

Re: [DISCUSS]: Integrate column stats index with all query engines

2022-08-10 Thread Pratyaksh Sharma

Surely we can work together once we get some feedback on the RFC Meng!

On Thu, Aug 11, 2022 at 9:32 AM 1037817390 
wrote:

> +1 for this
> it will be better to provide some filter converters to faciliate the
> integration of the engine:
> eg: converter presto domain to hudi domain
>
>
>
> and i have already finish the first version of dataskipping/partition
> prune/filter pushdown for presto,
>
> https://github.com/xiarixiaoyao/presto/commit/800646608d4b88799de0addcddd97d03592954ce
>
> maybe we can work together
>
>
>
>
>
>
>
> 孟涛
> mengtao0...@qq.com
>
>
>
> 
>
>
>
>
> --原始邮件--
> 发件人:
>   "dev"
> <
> vin...@apache.org;
> 发送时间:2022年8月11日(星期四) 中午12:11
> 收件人:"dev"
> 主题:Re: [DISCUSS]: Integrate column stats index with all query engines
>
>
>
> +1 for this.
>
> Suggested new reviewers on the RFC.
> https://github.com/apache/hudi/pull/6345/files#r943073339
>
> On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma  
> wrote:
>
>  Hello community,
> 
>  With the introduction of multi modal index in Hudi, there is a lot of
> scope
>  for improvement on the querying side. There are 2 major ways of
> reducing
>  the data scan at the time of querying - partition pruning and file
> pruning.
>  While with the latest developments in the community, partition
> pruning is
>  supported for commonly used query engines like spark, presto and
> hive, File
>  pruning using column stats index is only supported for spark and
> flink.
> 
>  We intend to support data skipping for the rest of the engines as well
>  which include hive, presto and trino. I have written a draft RFC here
> -
>  https://github.com/apache/hudi/pull/6345.
> 
>  Please take a look and let me know what you think. Once we have some
>  feedback from the community, we can decide on the next steps.
>

?????? [DISCUSS]: Integrate column stats index with all query engines

2022-08-10 Thread 1037817390

+1 for this
it will be better to provide some filter converters to faciliate the 
integration of the engine:
eg: converter presto domain to hudi domain



and i have already finish the first version of dataskipping/partition 
prune/filter pushdown for presto,
https://github.com/xiarixiaoyao/presto/commit/800646608d4b88799de0addcddd97d03592954ce

maybe we can work together








mengtao0...@qq.com








----
??: 
   "dev"
https://github.com/apache/hudi/pull/6345/files#r943073339

On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma https://github.com/apache/hudi/pull/6345.

 Please take a look and let me know what you think. Once we have some
 feedback from the community, we can decide on the next steps.

Re: 0.12.0 Release Timeline

2022-08-10 Thread Vinoth Chandar

Hello.

Any updates on RC2? :)

On Sat, Aug 6, 2022 at 10:36 AM sagar sumit  wrote:

> Hi folks,
>
> Thanks for voting on RC1.
> I will be preparing RC2 by Monday, 8th August end of day PST,
> and I will send out a separate voting email for RC2.
>
> Regards,
> Sagar
>
> On Fri, Jul 29, 2022 at 6:08 PM sagar sumit 
> wrote:
>
> > We can now resume merging to master branch.
> > Thanks for your patience.
> >
> > Regards,
> > Sagar
> >
>

Re: [DISCUSS]: Integrate column stats index with all query engines

2022-08-10 Thread Vinoth Chandar

+1 for this.

Suggested new reviewers on the RFC.
https://github.com/apache/hudi/pull/6345/files#r943073339

On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma 
wrote:

> Hello community,
>
> With the introduction of multi modal index in Hudi, there is a lot of scope
> for improvement on the querying side. There are 2 major ways of reducing
> the data scan at the time of querying - partition pruning and file pruning.
> While with the latest developments in the community, partition pruning is
> supported for commonly used query engines like spark, presto and hive, File
> pruning using column stats index is only supported for spark and flink.
>
> We intend to support data skipping for the rest of the engines as well
> which include hive, presto and trino. I have written a draft RFC here -
> https://github.com/apache/hudi/pull/6345.
>
> Please take a look and let me know what you think. Once we have some
> feedback from the community, we can decide on the next steps.
>

[DISCUSS]: Integrate column stats index with all query engines

2022-08-10 Thread Pratyaksh Sharma

Hello community,

With the introduction of multi modal index in Hudi, there is a lot of scope
for improvement on the querying side. There are 2 major ways of reducing
the data scan at the time of querying - partition pruning and file pruning.
While with the latest developments in the community, partition pruning is
supported for commonly used query engines like spark, presto and hive, File
pruning using column stats index is only supported for spark and flink.

We intend to support data skipping for the rest of the engines as well
which include hive, presto and trino. I have written a draft RFC here -
https://github.com/apache/hudi/pull/6345.

Please take a look and let me know what you think. Once we have some
feedback from the community, we can decide on the next steps.

Re: [new RFC Request] The need of Multiple event_time fields verification

2022-08-10 Thread Xinyao Tian

Just saw the PR has been approved. Thanks a lot for your time!


We will submit the RFC materials as soon as possible (within a few days to our 
best effort). Look forward to receiving your further feedback at that time.


Wish you have a good day :)
Xinyao
On 08/10/2022 13:44，Sivabalan wrote：
sure. Approved and landed!

On Tue, 9 Aug 2022 at 18:55, 田昕峣 (Xinyao Tian)  wrote:

Hi Sivabalan,




Thanks for you kind words! We have been working very hard to prepare
materials for the RFC this week since we got your feedback about our idea,
and I promise it will be very soon (within a few days) that everyone can
read our RFC and realize every details about this feature. It’s our
pleasure to make Hudi even more powerful by making this feature available
to everyone.




However, there’s one thing that we really need your help. According to the
RFC Process shown in Hudi Docs, we have to first raise a PR and add an
entry to rfc/README.md. But since this is the first time we raise a PR to
Hudi, it’s necessary to have a maintainer with write permission to approve
our PR. We have been wait for days but the PR is still in a pending status.




Therefore, may I ask you to help us to approve our first PR so that we
could submit our further materials to Hudi? The url of our pending PR is:
https://github.com/apache/hudi/pull/6328 and the corresponding Jira is:
https://issues.apache.org/jira/browse/HUDI-4569




Appreciate you so much for your help :)




Kind regards,

Xinyao Tian







On 08/9/2022 21:46，Sivabalan wrote：
Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks
benefitting from this.

On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian) 
wrote:

Hi Shiyan,


Thanks so much for your feedback as well as your kind encouragement! It’s
always our honor to contribute our effort to everyone and make Hudi much
awesome :)


We are now carefully preparing materials for the new RFC. Once we
finished, we would strictly follow the RFC process shown in the Hudi
official documentation to propose the new RFC and share all details of the
new feature as well as related code to everyone. Since we benefit from Hudi
community, we would like to give back our effort to the community and make
Hudi benefit more people!


As always, please stay healthy and keep safe.


Kind regards,
Xinyao Tian
On 08/6/2022 10:11，Shiyan Xu wrote：
Hi Xinyao, awesome achievement! And really appreciate your keenness in
contributing to Hudi. Certainly we'd love to see an RFC for this.

On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) 
wrote:

Greetings everyone,


My name is Xinyao and I'm currently working for an Insurance company. We
found that Apache Hudi is an extremely awesome utility and when it
cooprates with Apache Flink it can be even more powerful. Thus, we have
been using it for months and still keep benefiting from it.


However, there is one feature that we really desire but Hudi doesn't
currently have: It is called "Multiple event_time fields verification".
Because in the insurance industry, data is often stored distributed in
dozens of tables and conceptually connected by same primary keys. When the
data is being used, we often need to associate several or even dozens of
tables through the Join operation, and stitch all partial columns into an
entire record with dozens or even hundreds of columns for downstream
services to use.


Here comes to the problem. If we want to guarantee that every part of the
data being joined is up to date, Hudi must have the ability to filter
multiple event_time timestamps in a table and keep the most recent records.
So, in this scenario, the signle event_time filtering field provided by
Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
inadequate. Obviously, in order to cope with the use case with complex Join
operations like above, as well as to provide much potential for Hudi to
support more application scenarios and engage into more industries, Hudi
definitely needs to support the multiple event_time timestamps filtering
feature in a single table.


A good news is that, after more than two months of development, me and my
colleagues have made some changes in the hudi-flink and hudi-common modules
based on the hudi-0.10.0 and basically have achieved this feature.
Currently, my team is using the enhanced source code and working with Kafka
and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
than 140 million real-world insurance data and verifying the accuracy of
the data. The result is quite good: every part of the extremely-wide
records have been updated to latest status based on our continuous
observations during these weeks. We're very keen to make this new feature
available to everyone. We benefit from the Hudi community, so we really
desire to give back to the community with our efforts.


The only problem is that, we are not sure whether we need to create a RFC
to illusrtate our design and implementations in detail. According to "RFC
Process" in Hudi

Re: [DISCUSS]: Integrate column stats index with all query engines

?????? [DISCUSS]: Integrate column stats index with all query engines

Re: 0.12.0 Release Timeline

Re: [DISCUSS]: Integrate column stats index with all query engines

[DISCUSS]: Integrate column stats index with all query engines

Re: [new RFC Request] The need of Multiple event_time fields verification

6 matches

Site Navigation

Mail list logo

Footer information