Re: [DISCUSS]: Integrate column stats index with all query engines
Surely we can work together once we get some feedback on the RFC Meng! On Thu, Aug 11, 2022 at 9:32 AM 1037817390 wrote: > +1 for this > it will be better to provide some filter converters to faciliate the > integration of the engine: > eg: converter presto domain to hudi domain > > > > and i have already finish the first version of dataskipping/partition > prune/filter pushdown for presto, > > https://github.com/xiarixiaoyao/presto/commit/800646608d4b88799de0addcddd97d03592954ce > > maybe we can work together > > > > > > > > 孟涛 > mengtao0...@qq.com > > > > > > > > > --原始邮件-- > 发件人: > "dev" > < > vin...@apache.org; > 发送时间:2022年8月11日(星期四) 中午12:11 > 收件人:"dev" > 主题:Re: [DISCUSS]: Integrate column stats index with all query engines > > > > +1 for this. > > Suggested new reviewers on the RFC. > https://github.com/apache/hudi/pull/6345/files#r943073339 > > On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma > wrote: > > Hello community, > > With the introduction of multi modal index in Hudi, there is a lot of > scope > for improvement on the querying side. There are 2 major ways of > reducing > the data scan at the time of querying - partition pruning and file > pruning. > While with the latest developments in the community, partition > pruning is > supported for commonly used query engines like spark, presto and > hive, File > pruning using column stats index is only supported for spark and > flink. > > We intend to support data skipping for the rest of the engines as well > which include hive, presto and trino. I have written a draft RFC here > - > https://github.com/apache/hudi/pull/6345. > > Please take a look and let me know what you think. Once we have some > feedback from the community, we can decide on the next steps. >
?????? [DISCUSS]: Integrate column stats index with all query engines
+1 for this it will be better to provide some filter converters to faciliate the integration of the engine: eg: converter presto domain to hudi domain and i have already finish the first version of dataskipping/partition prune/filter pushdown for presto, https://github.com/xiarixiaoyao/presto/commit/800646608d4b88799de0addcddd97d03592954ce maybe we can work together mengtao0...@qq.com ---- ??: "dev" https://github.com/apache/hudi/pull/6345/files#r943073339 On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma https://github.com/apache/hudi/pull/6345. Please take a look and let me know what you think. Once we have some feedback from the community, we can decide on the next steps.
Re: 0.12.0 Release Timeline
Hello. Any updates on RC2? :) On Sat, Aug 6, 2022 at 10:36 AM sagar sumit wrote: > Hi folks, > > Thanks for voting on RC1. > I will be preparing RC2 by Monday, 8th August end of day PST, > and I will send out a separate voting email for RC2. > > Regards, > Sagar > > On Fri, Jul 29, 2022 at 6:08 PM sagar sumit > wrote: > > > We can now resume merging to master branch. > > Thanks for your patience. > > > > Regards, > > Sagar > > >
Re: [DISCUSS]: Integrate column stats index with all query engines
+1 for this. Suggested new reviewers on the RFC. https://github.com/apache/hudi/pull/6345/files#r943073339 On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma wrote: > Hello community, > > With the introduction of multi modal index in Hudi, there is a lot of scope > for improvement on the querying side. There are 2 major ways of reducing > the data scan at the time of querying - partition pruning and file pruning. > While with the latest developments in the community, partition pruning is > supported for commonly used query engines like spark, presto and hive, File > pruning using column stats index is only supported for spark and flink. > > We intend to support data skipping for the rest of the engines as well > which include hive, presto and trino. I have written a draft RFC here - > https://github.com/apache/hudi/pull/6345. > > Please take a look and let me know what you think. Once we have some > feedback from the community, we can decide on the next steps. >
[DISCUSS]: Integrate column stats index with all query engines
Hello community, With the introduction of multi modal index in Hudi, there is a lot of scope for improvement on the querying side. There are 2 major ways of reducing the data scan at the time of querying - partition pruning and file pruning. While with the latest developments in the community, partition pruning is supported for commonly used query engines like spark, presto and hive, File pruning using column stats index is only supported for spark and flink. We intend to support data skipping for the rest of the engines as well which include hive, presto and trino. I have written a draft RFC here - https://github.com/apache/hudi/pull/6345. Please take a look and let me know what you think. Once we have some feedback from the community, we can decide on the next steps.
Re: [new RFC Request] The need of Multiple event_time fields verification
Just saw the PR has been approved. Thanks a lot for your time! We will submit the RFC materials as soon as possible (within a few days to our best effort). Look forward to receiving your further feedback at that time. Wish you have a good day :) Xinyao On 08/10/2022 13:44,Sivabalan wrote: sure. Approved and landed! On Tue, 9 Aug 2022 at 18:55, 田昕峣 (Xinyao Tian) wrote: Hi Sivabalan, Thanks for you kind words! We have been working very hard to prepare materials for the RFC this week since we got your feedback about our idea, and I promise it will be very soon (within a few days) that everyone can read our RFC and realize every details about this feature. It’s our pleasure to make Hudi even more powerful by making this feature available to everyone. However, there’s one thing that we really need your help. According to the RFC Process shown in Hudi Docs, we have to first raise a PR and add an entry to rfc/README.md. But since this is the first time we raise a PR to Hudi, it’s necessary to have a maintainer with write permission to approve our PR. We have been wait for days but the PR is still in a pending status. Therefore, may I ask you to help us to approve our first PR so that we could submit our further materials to Hudi? The url of our pending PR is: https://github.com/apache/hudi/pull/6328 and the corresponding Jira is: https://issues.apache.org/jira/browse/HUDI-4569 Appreciate you so much for your help :) Kind regards, Xinyao Tian On 08/9/2022 21:46,Sivabalan wrote: Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks benefitting from this. On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian) wrote: Hi Shiyan, Thanks so much for your feedback as well as your kind encouragement! It’s always our honor to contribute our effort to everyone and make Hudi much awesome :) We are now carefully preparing materials for the new RFC. Once we finished, we would strictly follow the RFC process shown in the Hudi official documentation to propose the new RFC and share all details of the new feature as well as related code to everyone. Since we benefit from Hudi community, we would like to give back our effort to the community and make Hudi benefit more people! As always, please stay healthy and keep safe. Kind regards, Xinyao Tian On 08/6/2022 10:11,Shiyan Xu wrote: Hi Xinyao, awesome achievement! And really appreciate your keenness in contributing to Hudi. Certainly we'd love to see an RFC for this. On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) wrote: Greetings everyone, My name is Xinyao and I'm currently working for an Insurance company. We found that Apache Hudi is an extremely awesome utility and when it cooprates with Apache Flink it can be even more powerful. Thus, we have been using it for months and still keep benefiting from it. However, there is one feature that we really desire but Hudi doesn't currently have: It is called "Multiple event_time fields verification". Because in the insurance industry, data is often stored distributed in dozens of tables and conceptually connected by same primary keys. When the data is being used, we often need to associate several or even dozens of tables through the Join operation, and stitch all partial columns into an entire record with dozens or even hundreds of columns for downstream services to use. Here comes to the problem. If we want to guarantee that every part of the data being joined is up to date, Hudi must have the ability to filter multiple event_time timestamps in a table and keep the most recent records. So, in this scenario, the signle event_time filtering field provided by Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in order to cope with the use case with complex Join operations like above, as well as to provide much potential for Hudi to support more application scenarios and engage into more industries, Hudi definitely needs to support the multiple event_time timestamps filtering feature in a single table. A good news is that, after more than two months of development, me and my colleagues have made some changes in the hudi-flink and hudi-common modules based on the hudi-0.10.0 and basically have achieved this feature. Currently, my team is using the enhanced source code and working with Kafka and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million real-world insurance data and verifying the accuracy of the data. The result is quite good: every part of the extremely-wide records have been updated to latest status based on our continuous observations during these weeks. We're very keen to make this new feature available to everyone. We benefit from the Hudi community, so we really desire to give back to the community with our efforts. The only problem is that, we are not sure whether we need to create a RFC to illusrtate our design and implementations in detail. According to "RFC Process" in Hudi