Re: the best approach to contribute lots of safe non-breaking micro-fixes to clean up the whole code base

2024-02-27 Thread Vinoth Chandar
Thanks for starting this.

I think refactoring without clear end goals could cause a bunch of
thrashing. There are some active code restructuring efforts (see the
storage abstraction, file group reader etc).. Someone can speak if they
need some help there.

but that said, there are plenty of places. We have some of this filed away
under "code-quality" component. Could you file some concrete JIRAs there,
that we can then review/help prioritize?

For starters, removing build warnings can be a good, concrete starting
point.

Thanks
Vinoth

On Sat, Feb 24, 2024 at 3:39 AM wombatu-kun o_0 
wrote:

> Hi!
> I want to participate in hudi development more active and i need your
> advice for this to start.
> At the moment i'm not really familiar with hudi to fix complex bugs,
> develop new outstanding features or make some global optimizations by
> myself. I take from jira only simple tickets, or create my own and solve
> them.
> While solving these easy tasks i face lots of little smelly things in the
> code (such as: typos, concatenation in logging, useless
> variables/fields/arguments/exceptions, missing annotations, raw type usage,
> etc) that i would like to fix immediately. But it is not welcome in
> community to mix in single PR realization of target jira-task and such
> refactoring.
> Also i would like to understand the code and hudi functionality better to
> be able to make more serious contribution in the future.
> And while figuring out hudi codebase i want not just to get better
> understanding for myself, but also to do something useful for hudi project.
>
> So, my intention is to figure out with hudi globally starting from making
> micro-refactoring. And my plan is:
> start from simple: attentively review all code base and methodically do
> lots of trivial cosmetic micro-fixes, that make code cleaner (examples of
> improvements are listed above);
> during p.1 note for myself places in code (methods, classes, families)
> that needs more complex refactoring or should be optimized;
> make refactoring/optimizations from p.2 and for each case create it's own
> jira-task and PR (or MINOR PR if there are not many changes);
> ...
> PROFIT!!
> In case of p.1 i have some questions to ask you:
> do we need such clean up?
> if yes, what is the best approach to contribute lots of safe non-breaking
> micro-fixes to clean up the code? I mean dividing such changes by
> jira-tasks and PRs.
> what is acceptable number of files changed by single MINOR PR?
> If i make 1 PR per module, then even on middle-sized module there will be
> too many diffs, that reviewer won't like at all (and will never approve
> it). If I additionally divide amount of changes by multiple PRs there will
> be too many trivial PRs that produce extra load on ci.
>
> Patiently waiting for your advice.
>


Re: Invitation to contribute to OneTable

2023-12-05 Thread Vinoth Chandar
Thanks for including the Hudi community!

I am happy to participate and this project helps us expose Hudi data into
all available engines out there. So excited for that.

On Mon, Dec 4, 2023 at 1:28 PM Jesus Camacho Rodriguez 
wrote:

> Hi All,
>
> We are reaching out regarding a new project in the table formats space -
> OneTable.
>
> OneTable[1] is an omni-directional converter for table formats that
> facilitates interoperability across data processing systems and query
> engines. Currently, OneTable supports widely adopted open-source table
> formats such as Apache Hudi, Apache Iceberg, and Delta Lake.
>
> We have submitted a proposal to incubate OneTable in the ASF that can be
> found here:
> https://cwiki.apache.org/confluence/display/INCUBATOR/OneTable+Proposal
>
> One important attribute for OneTable's success is to ensure it is equally
> effective across all table formats without creating distinctions between
> them. We recognize that in-depth knowledge about each format primarily
> resides within the respective communities that develop them. Thus, as part
> of our incubation discussions, we are reaching out to the developer mailing
> lists of the different table format communities and we invite you to also
> participate in OneTable's community if you are interested. To learn about
> the project roadmap, ask questions, contribute code, discuss technical
> issues, please see the OneTable Github repository here:
> https://github.com/onetable-io/onetable
>
> To share comments on the incubation proposal to ASF, please see this
> discussion:
> https://lists.apache.org/thread/rx9z8ffrf37qjhpkf1vp5rqg5lhht7jm
>
> Thanks,
> Jesús (on behalf of the OneTable community)
>
> [1] https://github.com/onetable-io/onetable
>


Re: [VOTE] Release 1.0.0-beta1, release candidate #1

2023-11-13 Thread Vinoth Chandar
+1 (binding)

On Sun, Nov 12, 2023 at 10:07 PM Y Ethan Guo  wrote:

> +1 (binding)
>
> - Source, bundle validation pass
> - Ran Spark Quickstart (Datasource in Scala, SQL) on Spark 3.3
> - Ran long-running Hudi streamer jobs writing COW and MOR tables
>
> On Sat, Nov 11, 2023 at 12:24 AM sagar sumit  wrote:
>
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version
> > 1.0.0-beta1,
> > as follows:
> >
> > [ ] +1, Approve the release
> >
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> >
> > The complete staging area is available for your review, which includes:
> >
> > * JIRA release notes [1],
> >
> > * the official Apache source release and binary convenience releases to
> be
> > deployed to dist.apache.org [2], which are signed with the key with
> > fingerprint 888A9341E600EB8550AACD5EFB1B7504F7F770C9 [3],
> >
> > * all artifacts to be deployed to the Maven Central Repository [4],
> >
> > * source code tag "1.0.0-beta-rc1" [5]
> >
> >
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> >
> >
> > Thanks,
> > Sagar Sumit
> > Release Manager
> >
> >
> > [1]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12351210
> >
> > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-1.0.0-beta1-rc1
> >
> > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> >
> > [4]
> https://repository.apache.org/content/repositories/orgapachehudi-1129
> >
> > [5] https://github.com/apache/hudi/releases/tag/release-1.0.0-beta1-rc1
> >
>


Request for review : RFC-66 / Non Blocking Concurrency Control

2023-09-28 Thread Vinoth Chandar
Hi all,

We have some promising results and a more finalized approach for newer
concurrency control for Hudi 1.0.
Please help review this rfc https://github.com/apache/hudi/pull/7907

Thanks
Vinoth


Re: Calling for 0.12.4 release

2023-09-22 Thread Vinoth Chandar
+1 thanks Yue!

On Thu, Sep 21, 2023 at 18:19 Danny Chan  wrote:

> Thanks Yue Zhang for the contribution ~
>
> Best,
> Danny
>
> Y Ethan Guo  于2023年9月2日周六 00:24写道:
> >
> > Thanks, Yue Zhang, for volunteering to be the RM!
> >
> > On Thu, Aug 31, 2023 at 4:38 PM Yue Zhang 
> wrote:
> >
> > > Hi Hudiers,
> > > I volunteer to be the RM for the next 0.12.4 if u don’t mind
> > > YueZhang
> > >  Replied Message 
> > > | From | Y Ethan Guo |
> > > | Date | 09/01/2023 07:34 |
> > > | To | dev |
> > > | Subject | Calling for 0.12.4 release |
> > > Hi folks,
> > >
> > > It's been 4+ months since Hudi 0.12.3 was released.  As we want to
> maintain
> > > 0.12.x LTS releases, shall we, as a community, follow up with 0.12.4
> > > release to pick up recent bug fixes and improvements?  Any volunteer
> for
> > > 0.12.4 Release Manager is welcome.
> > >
> > > Thanks,
> > > - Ethan
> > >
>


Re: [VOTE] Release 0.14.0, release candidate #2

2023-09-14 Thread Vinoth Chandar
For all, link [2] should be
https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc2/

On Wed, Sep 13, 2023 at 11:53 AM Prashant Wason 
wrote:

> Hi everyone,
>
> Please review and vote on the *release candidate #2* for the version
> 0.14.0, as follows:
>
> [ ] +1, Approve the release
>
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
>
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org
>  [2], which
> are signed with the key with
> fingerprint 75C5744E9E5CD5C48E19C082C4D858D73B9DB1B8 [3],
>
> * all artifacts to be deployed to the Maven Central Repository [4],
>
> * source code tag "0.14.0-rc2" [5],
>
>
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
>
>
> Thanks,
>
> Prashant Wason
>
>
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352700
>
> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.14.0-rc1/
>
> [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
>
> [4] https://repository.apache.org/content/repositories/orgapachehudi-1126/
>
> [5] https://github.com/apache/hudi/releases/tag/release-0.14.0-rc2
> 
>


Re: [DISCUSS] Multi-table transactions

2023-08-30 Thread Vinoth Chandar
+1 Reviewed the RFC. Looks like a promising direction to take.

On Thu, Aug 24, 2023 at 9:26 AM sagar sumit  wrote:

> Hi devs,
>
> RFC-69 proposes some exciting features and in line with that vision,
> I would like to propose support for multi-table transactions in Hudi.
>
> As the name suggests, this would enable transactional consistency
> across multiple tables, i.e. a set of changes to multiple tables either
> completely succeeds or completely fails. This could be helpful for use
> cases such as updating details about a sales order that affects 2 or more
> tables, deleting records for a customer across 2 or more tables, etc.
>
> Hudi already provides ACID guarantees on a single table and tunable
> concurrency control. We would need to build additional orchestration or
> consistency mechanisms on top of existing mechanisms. I would like to
> put more details in a separate RFC. However, the high-level goal is to
> provide the same guarantees as Hudi provides for a single table and
> should work with both kinds of concurrency control OCC and MVCC.
>
> Looking forward to hearing some thoughts from you all.
>
> Regards,
> Sagar
>


Re: [DISCUSS] Release Manager for 1.0

2023-08-16 Thread Vinoth Chandar
Awesome! that was easy. lets go!

On Wed, Aug 16, 2023 at 5:32 AM sagar sumit  wrote:

> Hi Vinoth,
>
> 1.0 seems to be packed with exciting features.
> I would be glad to volunteer as the release manager.
>
> Regards,
> Sagar
>
> On Wed, Aug 16, 2023 at 5:24 PM Vinoth Chandar  wrote:
>
> > Hi PMC/Committers,
> >
> > We are looking for a volunteer to act as release manager for the 1.0
> > release.
> > https://cwiki.apache.org/confluence/display/HUDI/1.0+Execution+Planning
> >
> > Anyone interested?
> >
> > Thanks
> > Vinoth
> >
>


[DISCUSS] Release Manager for 1.0

2023-08-16 Thread Vinoth Chandar
Hi PMC/Committers,

We are looking for a volunteer to act as release manager for the 1.0
release.
https://cwiki.apache.org/confluence/display/HUDI/1.0+Execution+Planning

Anyone interested?

Thanks
Vinoth


Re: DISCUSS Hudi 1.x plans

2023-08-16 Thread Vinoth Chandar
Hello everyone,

We have been doing a lot of foundational design, prototyping work and I
have outlined an execution plan here.
https://cwiki.apache.org/confluence/display/HUDI/1.0+Execution+Planning

Look forward to contributions!


On Wed, May 10, 2023 at 4:14 PM Sivabalan  wrote:

> Great! Left some feedback.
>
> On Wed, 10 May 2023 at 06:56, Vinoth Chandar  wrote:
> >
> > All - the RFC is up here. Please comment on the PR or use the dev list to
> > discuss ideas.
> > https://github.com/apache/hudi/pull/8679/
> >
> > On Mon, May 8, 2023 at 11:43 PM Vinoth Chandar 
> wrote:
> >
> > > I have claimed RFC-69, per our process.
> > >
> > > On Mon, May 8, 2023 at 9:19 PM Vinoth Chandar 
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I have been consolidating all our progress on Hudi and putting
> together a
> > >> proposal for Hudi 1.x vision and a concrete plan for the first
> version 1.0.
> > >>
> > >> Will plan to open up the RFC to gather ideas across the community in
> > >> coming days.
> > >>
> > >> Thanks
> > >> Vinoth
> > >>
> > >
>
>
>
> --
> Regards,
> -Sivabalan
>


Re: [Feature Request] Support "faking" hudi commit time with the value of some field in the record

2023-08-16 Thread Vinoth Chandar
(sorry for the late reply)

Hi - the commit time can be a logical time as well, a lot of tests work
this way. There may be some table features (e.g time based cleaning) that
may not work, but those are more convenience ones anyway.

I assume, the consumer would process all events at the required source
timestamp boundary to achieve this?

I am happy to chat/help scope the changes more.



On Wed, Aug 2, 2023 at 1:17 PM Joseph Thaidigsman
 wrote:

> Hello,
>
> We have a use-case where we have persisted the full CDC changelog for some
> tables in s3 and want to be able to bootstrap hudi tables with the
> changelog data and then be able to time-travel the hudi table to get
> snapshot views of the table on dates prior to bootstrapping. In our
> changelog, we have the timestamp associated with the
> inserts/updates/deletes, so the data to achieve this is present. If we had
> a live consumer processing those events in real-time and writing them to a
> hudi table, then we would be able to achieve this, but because we are
> instead creating the hudi table from a single batch job, we are unable to
> achieve it despite processing the same exact data, since time-travel is all
> based on the hudi commit time.
>
> Aside from our specific use-case for bootstrapping tables, this would be
> useful for real-time CDC consumers as well.  Currently, there is no way to
> guarantee the accuracy of the time-travel operation as it relates to
> reflecting the state of the upstream database table at a given point in
> time. For example, say you have some downstream batch pipelines that want
> to perform some aggregations based on production database tables at a fixed
> point each day. In the case of lag or outage on the consumer-side, when the
> consumer restarts, we have a large gap in hudi commit time and are unable
> to time-travel to the exact moment that the downstream pipelines expect to
> reflect the database table state.
>
> If the hudi writer instead supported picking some field from the CDC record
> as the value for the hudi commit time, then the consumer could process the
> events at any time and the time-travel functionality would be the same
> regardless of consumption time. This would make the writer idempotent in a
> way that it currently lacks, guaranteeing consistent results for downstream
> pipelines.
>
> Original Slack Thread:
> https://apache-hudi.slack.com/archives/C4D716NPQ/p1690583690053259
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-08-16 Thread Vinoth Chandar
Hi Pratyaksh,

Are you still actively driving this?

On Tue, Jul 11, 2023 at 2:18 PM Pratyaksh Sharma 
wrote:

> Update: I will be raising the initial draft of RFC in the next couple of
> days.
>
> On Thu, Jun 15, 2023 at 2:28 AM Rajesh Mahindra 
> wrote:
>
> > Great. We also need it for use cases of loading data into warehouses, and
> > would love to help.
> >
> > On Wed, Jun 14, 2023 at 9:06 AM Pratyaksh Sharma 
> > wrote:
> >
> > > Hi,
> > >
> > > I missed this email earlier. Sure let me start an RFC this week and we
> > can
> > > take it from there.
> > >
> > > On Wed, Jun 14, 2023 at 9:20 PM Nicolas Paris <
> nicolas.pa...@riseup.net>
> > > wrote:
> > >
> > > > Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a
> > use
> > > > case to do hudi => Kafka and would enjoy building a more general
> tool.
> > > >
> > > > However we need a rfc basis to start some effort in the right way
> > > >
> > > > On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar <
> > > > mail.vinoth.chan...@gmail.com> wrote:
> > > > >Cool. lets draw up a RFC for this? @pratyaksh - do you want to start
> > > one,
> > > > >given you expressed interest?
> > > > >
> > > > >On Mon, Apr 10, 2023 at 7:32 PM Léo Biscassi <
> leo.bisca...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> +1
> > > > >> This would be great!
> > > > >>
> > > > >> Cheers,
> > > > >>
> > > > >> On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma <
> > > pratyaks...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Vinoth,
> > > > >> >
> > > > >> > I am aligned with the first reason that you mentioned. Better to
> > > have
> > > > a
> > > > >> > separate tool to take care of this.
> > > > >> >
> > > > >> > On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar <
> > > > >> > mail.vinoth.chan...@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > +1
> > > > >> > >
> > > > >> > > I was thinking that we add a new utility and NOT extend
> > > > DeltaStreamer
> > > > >> by
> > > > >> > > adding a Sink interface, for the following reasons
> > > > >> > >
> > > > >> > > - It will make it look like a generic Source => Sink ETL tool,
> > > > which is
> > > > >> > > actually not our intention to support on Hudi. There are
> plenty
> > of
> > > > good
> > > > >> > > tools for that out there.
> > > > >> > > - the config management can get bit hard to understand, since
> we
> > > > >> overload
> > > > >> > > ingest and reverse ETL into a single tool. So break it off at
> > > > use-case
> > > > >> > > level?
> > > > >> > >
> > > > >> > > Thoughts?
> > > > >> > >
> > > > >> > > David:  PMC does not have control over that. Please see
> > > unsubscribe
> > > > >> > > instructions here.
> > https://hudi.apache.org/community/get-involved
> > > > >> > > Love to keep this thread about reverse streamer discussion. So
> > > > kindly
> > > > >> > fork
> > > > >> > > another thread if you want to discuss unsubscribing.
> > > > >> > >
> > > > >> > > On Fri, Mar 31, 2023 at 1:47 AM Davidiam <
> > david.rosa...@gmail.com
> > > >
> > > > >> > wrote:
> > > > >> > >
> > > > >> > > > Hello Vinoth,
> > > > >> > > >
> > > > >> > > > Can you please unsubscribe me?  I have been trying to
> > > unsubscribe
> > > > for
> > > > >> > > > months without success.
> > > > >> > > >
> > > > >> > > > Kind Regards,
> > > > >> > > > David
> > > > >> > > >
> > > > >> > > > Sent from Outlook for Android<https://aka.ms/AAb9ysg>
> > > > >> > > > 
> > > > >> > > > From: Vinoth Chandar 
> > > > >> > > > Sent: Friday, March 31, 2023 5:09:52 AM
> > > > >> > > > To: dev 
> > > > >> > > > Subject: [DISCUSS] Hudi Reverse Streamer
> > > > >> > > >
> > > > >> > > > Hi all,
> > > > >> > > >
> > > > >> > > > Any interest in building a reverse streaming tool, that does
> > the
> > > > >> > reverse
> > > > >> > > of
> > > > >> > > > what the DeltaStreamer tool does? It will read Hudi table
> > > > >> incrementally
> > > > >> > > > (only source) and write out the data to a variety of sinks -
> > > > Kafka,
> > > > >> > JDBC
> > > > >> > > > Databases, DFS.
> > > > >> > > >
> > > > >> > > > This has come up many times with data warehouse users. Often
> > > > times,
> > > > >> > they
> > > > >> > > > want to use Hudi to speed up or reduce costs on their data
> > > > ingestion
> > > > >> > and
> > > > >> > > > ETL (using Spark/Flink), but want to move the derived data
> > back
> > > > into
> > > > >> a
> > > > >> > > data
> > > > >> > > > warehouse or an operational database for serving.
> > > > >> > > >
> > > > >> > > > What do you all think?
> > > > >> > > >
> > > > >> > > > Thanks
> > > > >> > > > Vinoth
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >>
> > > > >> --
> > > > >> *Léo Biscassi*
> > > > >> Blog - https://leobiscassi.com
> > > > >>
> > > > >>-
> > > > >>
> > > >
> > >
> >
> >
> > --
> > Take Care,
> > Rajesh Mahindra
> >
>


Re: Record level index with not unique keys

2023-08-16 Thread Vinoth Chandar
Hi,

yes the indexing DAG can support this today and even if not, it can be
easily fixed. Main issue would be how we encode the mapping well.
for e.g if we want map from user_id to all events that belong to the user,
we need a different, scalable way of storing this mapping.

I can organize this work under the 1.0 stream, if you are interested in
driving.

Thanks
Vinoth


On Thu, Jul 13, 2023 at 1:09 PM nicolas paris 
wrote:

> Hello Prashant, thanks for your time.
>
>
> > With non unique keys how would tagging of records (for updates /
> deletes) work?
>
> Currently both GLOBAL_SIMPLE/BLOOM work out of the box in the mentioned
> context. See below pyspark script and results. As for the
> implementation, the tagLocationBacktoRecords returns a rdd of
> HoodieRecord with (key/part/location), and it can contain duplicate
> keys (then multiple records for same key).
>
> ```
> tableName = "test_global_bloom"
> basePath = f"/tmp/{tableName}"
>
> hudi_options = {
> "hoodie.table.name": tableName,
> "hoodie.datasource.write.recordkey.field": "event_id",
> "hoodie.datasource.write.partitionpath.field": "part",
> "hoodie.datasource.write.table.name": tableName,
> "hoodie.datasource.write.precombine.field": "ts",
> "hoodie.datasource.write.hive_style_partitioning": "true",
> "hoodie.datasource.hive_sync.enable": "false",
> "hoodie.metadata.enable": "true",
> "hoodie.index.type": "GLOBAL_BLOOM", # GLOBAL_SIMPLE works as well
> }
>
> # LET'S GEN DUPLS
> mode="overwrite"
> df =spark.sql("""select '1' as event_id, '2' as ts, '2' as part UNION
>  select '1' as event_id, '3' as ts, '3' as part UNION
>  select '1' as event_id, '2' as ts, '3' as part UNION
>  select '2' as event_id, '2' as ts, '3' as part""")
> df.write.format("hudi").options(**hudi_options).option("hoodie.datasour
> ce.write.operation", "BULK_INSERT").mode(mode).save(basePath)
> spark.read.format("hudi").load(basePath).select("event_id",
> "ts","part").show()
> # ++---++
> # |event_id| ts|part|
> # ++---++
> # |   1|  3|   3|
> # |   1|  2|   3|
> # |   2|  2|   3|
> # |   1|  2|   2|
> # ++---++
>
> # UPDATE
> mode="append"
> spark.sql("select '1' as event_id, '20' as ts, '4' as
> part").write.format("hudi").options(**hudi_options).option("hoodie.data
> source.write.operation", "UPSERT").mode(mode).save(basePath)
> spark.read.format("hudi").load(basePath).select("event_id",
> "ts","part").show()
> # ++---++
> # |event_id| ts|part|
> # ++---++
> # |   1| 20|   4|
> # |   1| 20|   4|
> # |   1| 20|   4|
> # |   2|  2|   3|
> # ++---++
>
> # DELETE
> mode="append"
> spark.sql("select 1 as
> event_id").write.format("hudi").options(**hudi_options).option("hoodie.
> datasource.write.operation", "DELETE").mode(mode).save(basePath)
> spark.read.format("hudi").load(basePath).select("event_id",
> "ts","part").show()
> # ++---++
> # |event_id| ts|part|
> # ++---++
> # |   2|  2|   3|
> # ++---++
> ```
>
>
> > How would record Index know which mapping of the array to
> return for a given record key?
>
> As well as GLOBAL_SIMPLE/BLOOM, for a given record key, the RLI would
> return a list of mapping. Then the operation (update, delete, FCOW ...)
> would apply to each location.
>
> To illustrate, we could get something like this in the MDT:
>
> |event_id:1|[
>  {part=2, -5811947225812876253, -6812062179961430298, 0,
> 1689147210233},
>  {part=3, -711947225812876253, -8812062179961430298, 1,
> 1689147210233},
>  {part=3, -1811947225812876253, -2812062179961430298, 0,
> 1689147210233}
>  ]|
>
>
> On Thu, 2023-07-13 at 10:17 -0700, Prashant Wason wrote:
> > Hi Nicolas,
> >
> > The RI feature is designed for max performance as it is at a record-
> > count
> > scale. Hence, the schema is simplified and minimized.
> >
> > With non unique keys how would tagging of records (for updates /
> > deletes)
> > work? How would record Index know which mapping of the array to
> > return for
> > a given record key?
> >
> > Thanks
> > Prashant
> >
> >
> >
> > On Wed, Jul 12, 2023 at 2:02 AM nicolas paris
> > 
> > wrote:
> >
> > > hi there,
> > >
> > > Just tested preview of RLI (rfc-08), amazing feature. Soon the fast
> > > COW
> > > (rfc-68) will be based on RLI to get the parquet offsets and allow
> > > targeting parquet row groups.
> > >
> > > RLI is a global index, therefore it assumes the hudi key is present
> > > in
> > > at most one parquet file. As a result in the MDT, the RLI is of
> > > type
> > > struct, and there is a 1:1 mapping w/ a given file.
> > >
> > > Type:
> > >|-- recordIndexMetadata: struct (nullable = true)
> > >||-- partition: string (nullable = false)
> > >||-- fileIdHighBits: long (nullable = false)
> > >||-- fileIdLowBits: long (nullable = false)
> > >|

Re: [DISCUSS] Should we support a service to manage all deltastreamer jobs?

2023-08-16 Thread Vinoth Chandar
+1 there are RFCs on table management services, but not specific to
deltastreamer itself.

Are you proposing building something specific to that?

On Wed, Jun 14, 2023 at 8:26 AM Pratyaksh Sharma 
wrote:

> Hi,
>
> Personally I am in favour of creating such a UI where monitoring and
> managing configurations is just a click away. That makes life a lot easier
> for users. So +1 on the proposal.
>
> I remember the work for it had started long back around 2019. You can check
> this RFC
> <
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233
> >
> for your reference. I am not sure why this work could not continue though.
>
> On Wed, Jun 14, 2023 at 4:28 PM 孔维 <18701146...@163.com> wrote:
>
> > Hi, team,
> >
> >
> > Background:
> > More and more hudi accesses use deltastreamer, resulting in a large
> number
> > of deltastreamer jobs that need to be managed. In our company, we also
> > manage a large number of deltastreamer jobs by ourselves, and there is a
> > lot of operation and maintenance management and monitoring work.
> > If we can provide such a deltastreamer service to create, manage, and
> > monitor all tasks in a unified manner, it can greatly reduce the
> management
> > pressure of deltastreamer, and at the same time lower the threshold for
> > using deltastreamer, which is conducive to the promotion and use of
> > deltastreamer.
> > At the same time, considering that deltastreamer already supports
> > configuration hot update capability [
> > https://github.com/apache/hudi/pull/8807], we can offer configuration
> hot
> > update capability based on the feature, and make configuration changes
> > without restarting the job.
> >
> >
> > We hope to provide:
> > Provides a web UI to support creation, management and monitoring of
> > deltastreamer tasks
> > Using configuration hot update capability to provide timely configuration
> > change capability
> >
> >
> > I don't know whether such a service is in line with the evolution of the
> > community, and I hope to receive your reply!
> >
> >
> > Best Regards
>


Re: About 0.14.0 Release Timeline

2023-06-21 Thread Vinoth Chandar
+1 from me.

On Wed, Jun 21, 2023 at 8:35 AM Prashant Wason 
wrote:

> Hello Everyone,
>
> I would like to start the discussion on the 0.14.0 release timeline. How
> about Jun 30 for feature freeze and July 15 for creating the release
> branch?
>
>
> Thanks
> Prashant Wason
> RM for 0.14.0
>


Re: [Action required] Default Spark profile changed to 3.2

2023-06-02 Thread Vinoth Chandar
Hi,

Just tried doing a mvn clean install -DskipTests, and the build failed. My
local SPARK_HOME is pointing to spark 3.3 installation.
Does that all matter now? Quite possible this is an issue with my setup,
just flagging.

Thanks
Vinoth

On Fri, May 26, 2023 at 8:30 AM Shiyan Xu 
wrote:

> Hi all,
>
> We recently landed a change
> <
> https://github.com/apache/hudi/commit/516c3d59404934e6a142ea1c9d97002c065f8a4f
> >
> in master switching the default Spark profile from 2.4 to 3.2. If your
> local Hudi repo is configured to use Spark 2.4, you may need to re-import
> the IDEA project (this may involve clearing `.idea/` folder and *.iml files
> in each maven module)
>
> I'd also like to acknowledge contributions from Forward Xu, Rahil, Zhang
> Yue, and Danny, who have previously worked on or helped tackle this
> migration, as it involved a lot of dependency issue wrangling and test
> fixes.
>
> Cheers,
>
> --
> Best,
> Shiyan
>


Re: [DISCUSSION] Simplify code structure for supporting multiple Spark versions in Hudi

2023-06-02 Thread Vinoth Chandar
This is a good topic, thanks for raising this. Overall our reliance on
spark classes/APIs that are declared experimental is an issue on paper. But
there is few other ways to get right performance without relying on these.
This has been the tricky issue IMO. Thoughts?

 I ll review the code organization more carefully and report back.

On Fri, Jun 2, 2023 at 04:23 Rahil C  wrote:

> Thanks Shawn for writing this, I would like to also add on to the Spark
> Discussion.
>
> Currently I think our integration with Spark is too tight, and brings up
> serious issues when upgrading.
>
> I will describe one example(however there are many more) but one area is we
> extend Spark's *ParquetFileFormat* in the following classes.
>
>
> https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormat.scala
>
> https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark3.2plus-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark32PlusHoodieParquetFileFormat.scala
>
> and specifically the main logic changes is we override
> *buildReaderWithPartitionValues
> *method
> *.*
> I understand the pro of reusability of spark's code, but the con is that we
> dont then get the latest changes from the latest implementation of these
> methods. This gets more complex as we then need to understand which spark
> changes are required to cherry pick over as spark upgrades, such as these
> issues.
>
> For spark 3.3.2 we faced several issues documented here
> https://github.com/apache/hudi/pull/8082,
> and for spark 3.4.0 we have encountered several issues as well.
> https://github.com/apache/hudi/pull/8682
>
> We also are not keeping up to date with certain spark features as a result
> of the integration we have made. I have created a JIRA that goes more into
> this in-depth to this.
> https://issues.apache.org/jira/browse/HUDI-6262
>
> Would be happy to sync with other hudi spark committers/experts, or anyone
> interested in revisiting this integration so that future spark work will be
> more achievable.
>
> Regards,
> Rahil Chertara
>
> On Tue, May 23, 2023 at 8:16 PM Shawn Chang  wrote:
>
> > Hi Hudi developers,
> >
> > I am writing to discuss the current code structure of the existing
> > hudi-spark-datasource and propose a more scalable approach for supporting
> > multiple Spark versions. The current structure involves common code
> shared
> > by several Spark versions, such as hudi-spark-common, hudi-spark3-common,
> > hudi-spark3.2plus-common, etc. (a detailed description can be found in
> the
> > readme here:
> >
> https://github.com/apache/hudi/blob/master/hudi-spark-datasource/README.md
> > ).
> > This setup aims to minimize duplicate code in Hudi. Hudi currently
> utilizes
> > the SparkAdapter to invoke specific code based on the Spark version,
> > allowing different Spark versions to trigger different logic.
> >
> > However, this code structure proves to be complex and hampers the process
> > of adding support for newer Spark versions. The current approach involves
> > the following steps:
> > 1) Identify breaking changes introduced by the new Spark version and
> patch
> > affected Hudi classes.
> > 2) Separate affected Hudi classes into different folders so that older
> > Spark versions can continue using the existing logic, while the new Spark
> > version can work with the updated Hudi classes.
> > 3) Connect SparkAdapter to these Hudi classes, enabling Hudi to utilize
> the
> > correct code based on the Spark version.
> > 4) Collect common code and place it in a new folder, such as
> > hudi-spark3.2plus-common, to reduce duplicate code.
> >
> > This convoluted process has significantly slowed down the pace of adding
> > support for newer Spark versions in Hudi. Fortunately, there is a simpler
> > alternative that can streamline the process. I propose removing the
> common
> > modules and having only one folder for each Spark version. For example:
> >
> >
> >
> >
> >
> >
> *hudi-spark-datasource/---hudi-spark2.4.0/---hudi-spark3.2.0/---hudi-spark3.3.0/...*
> >
> > With this revised code structure, each Spark version will have its own
> > corresponding Hudi module. The process of adding Spark support will be
> > simplified as follows:
> > 1) Copy the latest existing hudi-spark module to a new module,
> > hudi-spark.
> > 2) Identify breaking changes introduced by the new Spark version and
> patch
> > affected Hudi classes.
> >
> > Let's consider some pros and cons of this new code structure:
> > *Pros:*
> > -A more readable codebase, with each Spark version having its individual
> > module.
> > -Easier addition of support for new Spark versions by duplicating the
> most
> > recent module and making necessary modifications.
> > -Simpler implementation of improvements specific to a particular Spark
> > version.
> > *Cons:*
> > -Increased duplicate code (though this shouldn't impact 

Re: [ANNOUNCE] Apache Hudi 0.13.1 released

2023-06-02 Thread Vinoth Chandar
Thanks for driving this!

On Wed, May 31, 2023 at 10:00 Yue Zhang  wrote:

> The Apache Hudi team is pleased to announce the release of Apache Hudi
> 0.13.1
>
> Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
> Incrementals. Apache Hudi manages storage of large analytical datasets
> on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible
> storage) and provides the ability to query them.
>
> This release comes 3 months after 0.13.0 and 1 months after 0.12.3. This
> release is purely intended to fix stability and bugs, which includes more
> than 100 resolved
> issues. Fixes span many areas ranging from core writer fixes,
> metadata, timeline, engine specific fixes, table services, etc.
>
> For details on how to use Hudi, please look at the quick start page located
> at https://hudi.apache.org/docs/quick-start-guide.html
>
> If you'd like to download the source release, you can find it here:
> https://github.com/apache/hudi/releases/tag/release-0.13.1
>
> You can read more about the release (including release notes) here:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352250
>
> We welcome your help and feedback. For more information on how to
> report problems, and to get involved, visit the project website at:
> http://hudi.apache.org/
>
> Thanks to everyone involved!


Re: DISCUSS Hudi 1.x plans

2023-05-10 Thread Vinoth Chandar
All - the RFC is up here. Please comment on the PR or use the dev list to
discuss ideas.
https://github.com/apache/hudi/pull/8679/

On Mon, May 8, 2023 at 11:43 PM Vinoth Chandar  wrote:

> I have claimed RFC-69, per our process.
>
> On Mon, May 8, 2023 at 9:19 PM Vinoth Chandar  wrote:
>
>> Hi all,
>>
>> I have been consolidating all our progress on Hudi and putting together a
>> proposal for Hudi 1.x vision and a concrete plan for the first version 1.0.
>>
>> Will plan to open up the RFC to gather ideas across the community in
>> coming days.
>>
>> Thanks
>> Vinoth
>>
>


Re: Calling for 0.13.1 Release

2023-05-09 Thread Vinoth Chandar
Looks like the PR landed.


On Thu, May 4, 2023 at 1:27 PM nicolas paris 
wrote:

> Hi, any timeline for the 0.13.1 bugfix release ?
> may that one be added to the prep branch
> https://github.com/apache/hudi/pull/8432
>
>
> On Thu, 2023-03-09 at 11:21 -0600, Shiyan Xu wrote:
> > thanks for volunteering! let's collab on the release work
> >
> > On Sun, Mar 5, 2023 at 8:16 PM Forward Xu 
> > wrote:
> >
> > > +1, Thanks for Yue Zhang to be the RM for the next 0.13.
> > > ForwardXu
> > >
> > > Yue Zhang  于2023年3月3日周五 16:31写道:
> > >
> > > > Hi Hudiers,
> > > > I volunteer to be the RM for the next 0.13.1 if u don’t mind
> > > > :)
> > > >
> > > >
> > > > > >
> > > > Yue Zhang
> > > > >
> > > > >
> > > > zhangyue921...@163.com
> > > > >
> > > >
> > > >
> > > > On 03/3/2023 16:23,Y Ethan Guo wrote:
> > > > Hi folks,
> > > >
> > > > Given that we have already found a few critical issues affecting
> > > > 0.13.0
> > > > release, such as the following, I suggest that we, as a
> > > > community, follow
> > > > up with 0.13.1 release in a month to address reliability issues
> > > > in
> > > 0.13.0.
> > > > Any volunteer for 0.13.1 Release Manager is welcome.
> > > >
> > > > https://github.com/apache/hudi/pull/8026
> > > > https://github.com/apache/hudi/pull/8079
> > > > https://github.com/apache/hudi/pull/8080
> > > >
> > > > Thanks,
> > > > - Ethan
> > > >
> > >
> >
> >
>
>


Re: DISCUSS Hudi 1.x plans

2023-05-08 Thread Vinoth Chandar
I have claimed RFC-69, per our process.

On Mon, May 8, 2023 at 9:19 PM Vinoth Chandar  wrote:

> Hi all,
>
> I have been consolidating all our progress on Hudi and putting together a
> proposal for Hudi 1.x vision and a concrete plan for the first version 1.0.
>
> Will plan to open up the RFC to gather ideas across the community in
> coming days.
>
> Thanks
> Vinoth
>


Re: Calling for 0.14.0 Release Manager

2023-05-08 Thread Vinoth Chandar
Great! Look forward to a fantastic 0.14

On Thu, May 4, 2023 at 2:07 PM Sivabalan  wrote:

> thanks!
>
> On Wed, 3 May 2023 at 13:40, Prashant Wason 
> wrote:
> >
> > I volunteer to drive the 0.14.0.
> >
> > Thanks
> > Prashant
> >
> >
> > On Wed, May 3, 2023 at 1:28 PM Sivabalan  wrote:
> >
> > > It's been few months since we released 0.13.0. It's time to start
> > > preparing for the next major release. Can we have a volunteers to
> > > drive the 0.14.0 release.
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
>
>
>
> --
> Regards,
> -Sivabalan
>


DISCUSS Hudi 1.x plans

2023-05-08 Thread Vinoth Chandar
Hi all,

I have been consolidating all our progress on Hudi and putting together a
proposal for Hudi 1.x vision and a concrete plan for the first version 1.0.

Will plan to open up the RFC to gather ideas across the community in coming
days.

Thanks
Vinoth


Re: [DISCUSS] Hudi Reverse Streamer

2023-04-11 Thread Vinoth Chandar
Cool. lets draw up a RFC for this? @pratyaksh - do you want to start one,
given you expressed interest?

On Mon, Apr 10, 2023 at 7:32 PM Léo Biscassi  wrote:

> +1
> This would be great!
>
> Cheers,
>
> On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma 
> wrote:
>
> > Hi Vinoth,
> >
> > I am aligned with the first reason that you mentioned. Better to have a
> > separate tool to take care of this.
> >
> > On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar <
> > mail.vinoth.chan...@gmail.com>
> > wrote:
> >
> > > +1
> > >
> > > I was thinking that we add a new utility and NOT extend DeltaStreamer
> by
> > > adding a Sink interface, for the following reasons
> > >
> > > - It will make it look like a generic Source => Sink ETL tool, which is
> > > actually not our intention to support on Hudi. There are plenty of good
> > > tools for that out there.
> > > - the config management can get bit hard to understand, since we
> overload
> > > ingest and reverse ETL into a single tool. So break it off at use-case
> > > level?
> > >
> > > Thoughts?
> > >
> > > David:  PMC does not have control over that. Please see unsubscribe
> > > instructions here. https://hudi.apache.org/community/get-involved
> > > Love to keep this thread about reverse streamer discussion. So kindly
> > fork
> > > another thread if you want to discuss unsubscribing.
> > >
> > > On Fri, Mar 31, 2023 at 1:47 AM Davidiam 
> > wrote:
> > >
> > > > Hello Vinoth,
> > > >
> > > > Can you please unsubscribe me?  I have been trying to unsubscribe for
> > > > months without success.
> > > >
> > > > Kind Regards,
> > > > David
> > > >
> > > > Sent from Outlook for Android<https://aka.ms/AAb9ysg>
> > > > 
> > > > From: Vinoth Chandar 
> > > > Sent: Friday, March 31, 2023 5:09:52 AM
> > > > To: dev 
> > > > Subject: [DISCUSS] Hudi Reverse Streamer
> > > >
> > > > Hi all,
> > > >
> > > > Any interest in building a reverse streaming tool, that does the
> > reverse
> > > of
> > > > what the DeltaStreamer tool does? It will read Hudi table
> incrementally
> > > > (only source) and write out the data to a variety of sinks - Kafka,
> > JDBC
> > > > Databases, DFS.
> > > >
> > > > This has come up many times with data warehouse users. Often times,
> > they
> > > > want to use Hudi to speed up or reduce costs on their data ingestion
> > and
> > > > ETL (using Spark/Flink), but want to move the derived data back into
> a
> > > data
> > > > warehouse or an operational database for serving.
> > > >
> > > > What do you all think?
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>
>
> --
> *Léo Biscassi*
> Blog - https://leobiscassi.com
>
>-
>


Re: Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-07 Thread Vinoth Chandar
Pulled in another reviewer as well. Left a comment. We can move the
discussion to the PR?

Thanks for the useful contribution!

On Thu, Apr 6, 2023 at 12:34 AM 孔维 <18701146...@163.com> wrote:

> Hi, vinoth,
>
> I created a PR(https://github.com/apache/hudi/pull/8376) for this
> feature, could you help review it?
>
>
> BR,
> Kong
>
>
>
>
> At 2023-04-05 00:19:20, "Vinoth Chandar"  wrote:
> >Look forward to this! could really help backfill/rebootstrap scenarios.
> >
> >On Tue, Apr 4, 2023 at 9:18 AM Vinoth Chandar  wrote:
> >
> >> Thinking out loud.
> >>
> >> 1. For insert operations, it should not matter anyway.
> >> 2. For upsert etc, the preCombine would handle the ordering problems.
> >>
> >> Is that what you are saying? I feel we don't want to leak any Kafka
> >> specific logic or force use of special payloads etc. thoughts?
> >>
> >> I assigned the jira to you and also made you a contributor. So in future,
> >> you can self-assign.
> >>
> >> On Mon, Apr 3, 2023 at 7:08 PM 孔维 <18701146...@163.com> wrote:
> >>
> >>> Hi,
> >>>
> >>>
> >>> Yea, we can create multiple spark input partitions per Kafka partition.
> >>>
> >>>
> >>> I think the write operations can handle the potentially out-of-order
> >>> events, because before writing we need to preCombine the incoming events
> >>> using source-ordering-field and we also need to combineAndGetUpdateValue
> >>> with records on storage. From a business perspective, we use the combine
> >>> logic to keep our data correct. And hudi does not require any guarantees
> >>> about the ordering of kafka events.
> >>>
> >>>
> >>> I already filed one JIRA[https://issues.apache.org/jira/browse/HUDI-6019],
> >>> could you help assign the JIRA to me?
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> At 2023-04-03 23:27:13, "Vinoth Chandar"  wrote:
> >>> >Hi,
> >>> >
> >>> >Does your implementation read out offset ranges from Kafka partitions?
> >>> >which means - we can create multiple spark input partitions per Kafka
> >>> >partitions?
> >>> >if so, +1 for overall goals here.
> >>> >
> >>> >How does this affect ordering? Can you think about how/if Hudi write
> >>> >operations can handle potentially out-of-order events being read out?
> >>> >It feels like we can add a JIRA for this anyway.
> >>> >
> >>> >
> >>> >
> >>> >On Thu, Mar 30, 2023 at 10:02 PM 孔维 <18701146...@163.com> wrote:
> >>> >
> >>> >> Hi team, for the kafka source, when pulling data from kafka, the
> >>> default
> >>> >> parallelism is the number of kafka partitions.
> >>> >> There are cases:
> >>> >>
> >>> >> Pulling large amount of data from kafka (eg. maxEvents=1), but
> >>> the
> >>> >> # of kafka partition is not enough, the procedure of the pulling will
> >>> cost
> >>> >> too much of time, even worse cause the executor OOM
> >>> >> There is huge data skew between kafka partitions, the procedure of the
> >>> >> pulling will be blocked by the slowest partition
> >>> >>
> >>> >> to solve those cases, I want to add a parameter
> >>> >> hoodie.deltastreamer.kafka.per.batch.maxEvents to control the
> >>> maxEvents in
> >>> >> one kafka batch, default Long.MAX_VALUE means not trun this feature on.
> >>> >> hoodie.deltastreamer.kafka.per.batch.maxEvents  this confiuration will
> >>> >> take effect after the hoodie.deltastreamer.kafka.source.maxEvents
> >>> config.
> >>> >>
> >>> >>
> >>> >> Here is my POC of the imporvement:
> >>> >> max executor core is 128.
> >>> >> not turn the feature on
> >>> >> (hoodie.deltastreamer.kafka.source.maxEvents=5000)
> >>> >>
> >>> >>
> >>> >> turn on the feature
> >>> (hoodie.deltastreamer.kafka.per.batch.maxEvents=20)
> >>> >>
> >>> >>
> >>> >> after turn on the feature, the timing of Tagging reduce from 4.4 mins
> >>> to
> >>> >> 1.1 mins, can be more faster if given more cores.
> >>> >>
> >>> >> How do you think? can I file a jira issue for this?
> >>>
> >>
>
>


Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-04 Thread Vinoth Chandar
Look forward to this! could really help backfill/rebootstrap scenarios.

On Tue, Apr 4, 2023 at 9:18 AM Vinoth Chandar  wrote:

> Thinking out loud.
>
> 1. For insert operations, it should not matter anyway.
> 2. For upsert etc, the preCombine would handle the ordering problems.
>
> Is that what you are saying? I feel we don't want to leak any Kafka
> specific logic or force use of special payloads etc. thoughts?
>
> I assigned the jira to you and also made you a contributor. So in future,
> you can self-assign.
>
> On Mon, Apr 3, 2023 at 7:08 PM 孔维 <18701146...@163.com> wrote:
>
>> Hi,
>>
>>
>> Yea, we can create multiple spark input partitions per Kafka partition.
>>
>>
>> I think the write operations can handle the potentially out-of-order
>> events, because before writing we need to preCombine the incoming events
>> using source-ordering-field and we also need to combineAndGetUpdateValue
>> with records on storage. From a business perspective, we use the combine
>> logic to keep our data correct. And hudi does not require any guarantees
>> about the ordering of kafka events.
>>
>>
>> I already filed one JIRA[https://issues.apache.org/jira/browse/HUDI-6019],
>> could you help assign the JIRA to me?
>>
>>
>>
>>
>>
>>
>>
>> At 2023-04-03 23:27:13, "Vinoth Chandar"  wrote:
>> >Hi,
>> >
>> >Does your implementation read out offset ranges from Kafka partitions?
>> >which means - we can create multiple spark input partitions per Kafka
>> >partitions?
>> >if so, +1 for overall goals here.
>> >
>> >How does this affect ordering? Can you think about how/if Hudi write
>> >operations can handle potentially out-of-order events being read out?
>> >It feels like we can add a JIRA for this anyway.
>> >
>> >
>> >
>> >On Thu, Mar 30, 2023 at 10:02 PM 孔维 <18701146...@163.com> wrote:
>> >
>> >> Hi team, for the kafka source, when pulling data from kafka, the
>> default
>> >> parallelism is the number of kafka partitions.
>> >> There are cases:
>> >>
>> >> Pulling large amount of data from kafka (eg. maxEvents=1), but
>> the
>> >> # of kafka partition is not enough, the procedure of the pulling will
>> cost
>> >> too much of time, even worse cause the executor OOM
>> >> There is huge data skew between kafka partitions, the procedure of the
>> >> pulling will be blocked by the slowest partition
>> >>
>> >> to solve those cases, I want to add a parameter
>> >> hoodie.deltastreamer.kafka.per.batch.maxEvents to control the
>> maxEvents in
>> >> one kafka batch, default Long.MAX_VALUE means not trun this feature on.
>> >> hoodie.deltastreamer.kafka.per.batch.maxEvents  this confiuration will
>> >> take effect after the hoodie.deltastreamer.kafka.source.maxEvents
>> config.
>> >>
>> >>
>> >> Here is my POC of the imporvement:
>> >> max executor core is 128.
>> >> not turn the feature on
>> >> (hoodie.deltastreamer.kafka.source.maxEvents=5000)
>> >>
>> >>
>> >> turn on the feature
>> (hoodie.deltastreamer.kafka.per.batch.maxEvents=20)
>> >>
>> >>
>> >> after turn on the feature, the timing of Tagging reduce from 4.4 mins
>> to
>> >> 1.1 mins, can be more faster if given more cores.
>> >>
>> >> How do you think? can I file a jira issue for this?
>>
>


Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-04 Thread Vinoth Chandar
Thinking out loud.

1. For insert operations, it should not matter anyway.
2. For upsert etc, the preCombine would handle the ordering problems.

Is that what you are saying? I feel we don't want to leak any Kafka
specific logic or force use of special payloads etc. thoughts?

I assigned the jira to you and also made you a contributor. So in future,
you can self-assign.

On Mon, Apr 3, 2023 at 7:08 PM 孔维 <18701146...@163.com> wrote:

> Hi,
>
>
> Yea, we can create multiple spark input partitions per Kafka partition.
>
>
> I think the write operations can handle the potentially out-of-order
> events, because before writing we need to preCombine the incoming events
> using source-ordering-field and we also need to combineAndGetUpdateValue
> with records on storage. From a business perspective, we use the combine
> logic to keep our data correct. And hudi does not require any guarantees
> about the ordering of kafka events.
>
>
> I already filed one JIRA[https://issues.apache.org/jira/browse/HUDI-6019],
> could you help assign the JIRA to me?
>
>
>
>
>
>
>
> At 2023-04-03 23:27:13, "Vinoth Chandar"  wrote:
> >Hi,
> >
> >Does your implementation read out offset ranges from Kafka partitions?
> >which means - we can create multiple spark input partitions per Kafka
> >partitions?
> >if so, +1 for overall goals here.
> >
> >How does this affect ordering? Can you think about how/if Hudi write
> >operations can handle potentially out-of-order events being read out?
> >It feels like we can add a JIRA for this anyway.
> >
> >
> >
> >On Thu, Mar 30, 2023 at 10:02 PM 孔维 <18701146...@163.com> wrote:
> >
> >> Hi team, for the kafka source, when pulling data from kafka, the default
> >> parallelism is the number of kafka partitions.
> >> There are cases:
> >>
> >> Pulling large amount of data from kafka (eg. maxEvents=1), but
> the
> >> # of kafka partition is not enough, the procedure of the pulling will
> cost
> >> too much of time, even worse cause the executor OOM
> >> There is huge data skew between kafka partitions, the procedure of the
> >> pulling will be blocked by the slowest partition
> >>
> >> to solve those cases, I want to add a parameter
> >> hoodie.deltastreamer.kafka.per.batch.maxEvents to control the maxEvents
> in
> >> one kafka batch, default Long.MAX_VALUE means not trun this feature on.
> >> hoodie.deltastreamer.kafka.per.batch.maxEvents  this confiuration will
> >> take effect after the hoodie.deltastreamer.kafka.source.maxEvents
> config.
> >>
> >>
> >> Here is my POC of the imporvement:
> >> max executor core is 128.
> >> not turn the feature on
> >> (hoodie.deltastreamer.kafka.source.maxEvents=5000)
> >>
> >>
> >> turn on the feature
> (hoodie.deltastreamer.kafka.per.batch.maxEvents=20)
> >>
> >>
> >> after turn on the feature, the timing of Tagging reduce from 4.4 mins to
> >> 1.1 mins, can be more faster if given more cores.
> >>
> >> How do you think? can I file a jira issue for this?
>


Re: What precombine field really is used for and its future?

2023-04-04 Thread Vinoth Chandar
This current thread is another example of a practical need for pre
combine field.
 "[DISCUSS] split source of kafka partition by count"


On Tue, Apr 4, 2023 at 7:31 AM Vinoth Chandar  wrote:

> Thanks for raising this issue.
>
> Love to use this opp to share more context on why the preCombine field
> exists.
>
>- As you probably inferred already, we needed to eliminate duplicates,
>while dealing with out-of-order data (e.g database change records arriving
>in different orders from two Kafka clusters in two zones). So it was
>necessary to preCombine by a "event" field, rather than just the arrival
>time (which is what _hoodie_commit_time is).
>- This comes from stream processing concepts like
>https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/ ,
>which build upon inadequacies in traditional database systems to deal with
>things like this. At the end of the day, we are solving a "processing"
>problem IMO with Hudi - Hudi replaces existing batch/streaming pipelines,
>not OLTP databases. That's at-least the lens we approached it from.
>- For this to work end-end, it is not sufficient to just precombine
>within a batch of incoming writes, we also need to consistently apply the
>same against data in storage. In CoW, we implicitly merge against storage,
>so its simpler. But for MoR, we simply append records to log files, so we
>needed to make this a table property - such that queries/compaction can
>later do the right preCombine. Hope that clarifies the CoW vs MoR
>differences.
>
> On the issues raised/proposals here.
>
>1. I think we need some dedicated efforts across the different writer
>paths to make it easier. probably some lower hanging fruits here. Some of
>it results from just different authors contributing to different code paths
>in an OSS project.
>2. On picking a sane default precombine field. _hoodie_commit_time is
>a good candidate for preCombine field, as you point out, we would just pick
>1/many records with the same key arbitrarily, in that scenario. On
>storage/across commits, we would pick the value with the latest
>commit_time/last writer wins - which would make queries repeatedly provide
>the same consistent values as well.  Needs more thought.
>3. If the user desires to customize this behavior, they could supply a
>preCombine field that is different. This would be similar to semantics of
>event time vs arrival order processing in streaming systems. Personally, I
>need to spend a bit more time digging to come up with an elegant solution
>here.
>4. For the proposals on how Hudi could de-duplicate, after the fact
>that inserts introduced duplicates - I think the current behavior is a bit
>more condoning than what I'd like tbh. It updates both the records IIRC. I
>think Hudi should ensure record key uniqueness across different paths and
>fail the write if it's violated. - if we think of this as in RDBMS lens,
>that's what would happen, correct?
>
>
> Love to hear your thoughts. If we can file a JIRA or compile JIRAs with
> issues around this, we could discuss out short, long term plans?
>
> Thanks
> Vinoth
>
> On Sat, Apr 1, 2023 at 3:13 PM Ken Krugler 
> wrote:
>
>> Hi Daniel,
>>
>> Thanks for the detailed write-up.
>>
>> I can’t add much to the discussion, other than noting we also recently
>> ran into the related oddity that we don’t need to define a precombine when
>> writing data to a COW table (using Flink), but then trying to use Spark to
>> drop partitions failed because there’s a default precombine field name (set
>> to “ts”), and if that field doesn’t exist then the Spark job fails.
>>
>> — Ken
>>
>>
>> > On Mar 31, 2023, at 1:20 PM, Daniel Kaźmirski 
>> wrote:
>> >
>> > Hi all,
>> >
>> > I would like to bring up the topic of how precombine field is used and
>> > what's the purpose of it. I would also like to know what are the plans
>> for
>> > it in the future.
>> >
>> > At first glance precombine filed looks like it's only used to
>> deduplicate
>> > records in incoming batch.
>> > But when digging deeper it looks like it can/is also be used to:
>> > 1. combine records not before but on write to decide if update existing
>> > record (eg with DefaultHoodieRecordPayload)
>> > 2. combine records on read for MoR table to combine log and base files
>> > correctly.
>> > 3. precombine field is required for spark SQL UPDATE, even if user can't
>> > provide duplicates anyways 

Re: What precombine field really is used for and its future?

2023-04-04 Thread Vinoth Chandar
Thanks for raising this issue.

Love to use this opp to share more context on why the preCombine field
exists.

   - As you probably inferred already, we needed to eliminate duplicates,
   while dealing with out-of-order data (e.g database change records arriving
   in different orders from two Kafka clusters in two zones). So it was
   necessary to preCombine by a "event" field, rather than just the arrival
   time (which is what _hoodie_commit_time is).
   - This comes from stream processing concepts like
   https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/ ,
   which build upon inadequacies in traditional database systems to deal with
   things like this. At the end of the day, we are solving a "processing"
   problem IMO with Hudi - Hudi replaces existing batch/streaming pipelines,
   not OLTP databases. That's at-least the lens we approached it from.
   - For this to work end-end, it is not sufficient to just precombine
   within a batch of incoming writes, we also need to consistently apply the
   same against data in storage. In CoW, we implicitly merge against storage,
   so its simpler. But for MoR, we simply append records to log files, so we
   needed to make this a table property - such that queries/compaction can
   later do the right preCombine. Hope that clarifies the CoW vs MoR
   differences.

On the issues raised/proposals here.

   1. I think we need some dedicated efforts across the different writer
   paths to make it easier. probably some lower hanging fruits here. Some of
   it results from just different authors contributing to different code paths
   in an OSS project.
   2. On picking a sane default precombine field. _hoodie_commit_time is a
   good candidate for preCombine field, as you point out, we would just pick
   1/many records with the same key arbitrarily, in that scenario. On
   storage/across commits, we would pick the value with the latest
   commit_time/last writer wins - which would make queries repeatedly provide
   the same consistent values as well.  Needs more thought.
   3. If the user desires to customize this behavior, they could supply a
   preCombine field that is different. This would be similar to semantics of
   event time vs arrival order processing in streaming systems. Personally, I
   need to spend a bit more time digging to come up with an elegant solution
   here.
   4. For the proposals on how Hudi could de-duplicate, after the fact that
   inserts introduced duplicates - I think the current behavior is a bit more
   condoning than what I'd like tbh. It updates both the records IIRC. I think
   Hudi should ensure record key uniqueness across different paths and fail
   the write if it's violated. - if we think of this as in RDBMS lens, that's
   what would happen, correct?


Love to hear your thoughts. If we can file a JIRA or compile JIRAs with
issues around this, we could discuss out short, long term plans?

Thanks
Vinoth

On Sat, Apr 1, 2023 at 3:13 PM Ken Krugler 
wrote:

> Hi Daniel,
>
> Thanks for the detailed write-up.
>
> I can’t add much to the discussion, other than noting we also recently ran
> into the related oddity that we don’t need to define a precombine when
> writing data to a COW table (using Flink), but then trying to use Spark to
> drop partitions failed because there’s a default precombine field name (set
> to “ts”), and if that field doesn’t exist then the Spark job fails.
>
> — Ken
>
>
> > On Mar 31, 2023, at 1:20 PM, Daniel Kaźmirski 
> wrote:
> >
> > Hi all,
> >
> > I would like to bring up the topic of how precombine field is used and
> > what's the purpose of it. I would also like to know what are the plans
> for
> > it in the future.
> >
> > At first glance precombine filed looks like it's only used to deduplicate
> > records in incoming batch.
> > But when digging deeper it looks like it can/is also be used to:
> > 1. combine records not before but on write to decide if update existing
> > record (eg with DefaultHoodieRecordPayload)
> > 2. combine records on read for MoR table to combine log and base files
> > correctly.
> > 3. precombine field is required for spark SQL UPDATE, even if user can't
> > provide duplicates anyways with this sql statement.
> >
> > Regarding [3] there's inconsistency as precombine field is not required
> in
> > MERGE INTO UPDATE. Underneath UPSERT is switched to INSERT in upsert mode
> > to update existing records.
> >
> > I know that Hudi does a lot of work to ensure PK uniqueness across/within
> > partitions and there is a need to deduplicate records before write or to
> > deduplicate existing data if duplicates were introduced eg when using
> > non-strict insert mode.
> >
> > What should then happen in a situation where user does not want or can
> not
> > provide a pre-combine field? Then it's on user not to introduce
> duplicates,
> > but makes Hudi more generic and easier to use for "SQL" people.
> >
> > No precombine is possible for CoW, already, but UPSERT and SQL 

Re: [DISCUSS] Hudi Reverse Streamer

2023-04-03 Thread Vinoth Chandar
+1

I was thinking that we add a new utility and NOT extend DeltaStreamer by
adding a Sink interface, for the following reasons

- It will make it look like a generic Source => Sink ETL tool, which is
actually not our intention to support on Hudi. There are plenty of good
tools for that out there.
- the config management can get bit hard to understand, since we overload
ingest and reverse ETL into a single tool. So break it off at use-case
level?

Thoughts?

David:  PMC does not have control over that. Please see unsubscribe
instructions here. https://hudi.apache.org/community/get-involved
Love to keep this thread about reverse streamer discussion. So kindly fork
another thread if you want to discuss unsubscribing.

On Fri, Mar 31, 2023 at 1:47 AM Davidiam  wrote:

> Hello Vinoth,
>
> Can you please unsubscribe me?  I have been trying to unsubscribe for
> months without success.
>
> Kind Regards,
> David
>
> Sent from Outlook for Android<https://aka.ms/AAb9ysg>
> ________
> From: Vinoth Chandar 
> Sent: Friday, March 31, 2023 5:09:52 AM
> To: dev 
> Subject: [DISCUSS] Hudi Reverse Streamer
>
> Hi all,
>
> Any interest in building a reverse streaming tool, that does the reverse of
> what the DeltaStreamer tool does? It will read Hudi table incrementally
> (only source) and write out the data to a variety of sinks - Kafka, JDBC
> Databases, DFS.
>
> This has come up many times with data warehouse users. Often times, they
> want to use Hudi to speed up or reduce costs on their data ingestion and
> ETL (using Spark/Flink), but want to move the derived data back into a data
> warehouse or an operational database for serving.
>
> What do you all think?
>
> Thanks
> Vinoth
>


Re: [DISCUSS] split source of kafka partition by count

2023-04-03 Thread Vinoth Chandar
Hi,

Does your implementation read out offset ranges from Kafka partitions?
which means - we can create multiple spark input partitions per Kafka
partitions?
if so, +1 for overall goals here.

How does this affect ordering? Can you think about how/if Hudi write
operations can handle potentially out-of-order events being read out?
It feels like we can add a JIRA for this anyway.



On Thu, Mar 30, 2023 at 10:02 PM 孔维 <18701146...@163.com> wrote:

> Hi team, for the kafka source, when pulling data from kafka, the default
> parallelism is the number of kafka partitions.
> There are cases:
>
> Pulling large amount of data from kafka (eg. maxEvents=1), but the
> # of kafka partition is not enough, the procedure of the pulling will cost
> too much of time, even worse cause the executor OOM
> There is huge data skew between kafka partitions, the procedure of the
> pulling will be blocked by the slowest partition
>
> to solve those cases, I want to add a parameter
> hoodie.deltastreamer.kafka.per.batch.maxEvents to control the maxEvents in
> one kafka batch, default Long.MAX_VALUE means not trun this feature on.
> hoodie.deltastreamer.kafka.per.batch.maxEvents  this confiuration will
> take effect after the hoodie.deltastreamer.kafka.source.maxEvents config.
>
>
> Here is my POC of the imporvement:
> max executor core is 128.
> not turn the feature on
> (hoodie.deltastreamer.kafka.source.maxEvents=5000)
>
>
> turn on the feature (hoodie.deltastreamer.kafka.per.batch.maxEvents=20)
>
>
> after turn on the feature, the timing of Tagging reduce from 4.4 mins to
> 1.1 mins, can be more faster if given more cores.
>
> How do you think? can I file a jira issue for this?


Re: Re: Re: DISCUSS

2023-03-30 Thread Vinoth Chandar
I think we can focus more on validating the hash index + bloom filter vs
consistent hash index more first. Have you looked at RFC-08, which is a
kind of hash index as well, except it stores the key => file group mapping
externally.

On Fri, Mar 24, 2023 at 2:14 AM 吕虎  wrote:

> Hi Vinoth, I am very happy to receive your reply. Here are some of my
> thoughts。
>
> At 2023-03-21 23:32:44, "Vinoth Chandar"  wrote:
> >>but when it is used for data expansion, it still involves the need to
> >redistribute the data records of some data files, thus affecting the
> >performance.
> >but expansion of the consistent hash index is an optional operation right?
>
> >Sorry, not still fully understanding the differences here,
> I'm sorry I didn't make myself clearly. The expansion I mentioned last
> time refers to data records increase in hudi table.
> The difference between consistent hash index and hash partition with Bloom
> filters index is how to deal with  data increase:
> For consistent hash index, the way of splitting the file is used.
> Splitting files affects performance, but can permanently work effectively.
> So consistent hash index is  suitable for scenarios where data increase
> cannot be estimated or  data will increase large.
> For hash partitions with Bloom filters index, the way of creating  new
> files is used. Adding new files does not affect performance, but if there
> are too many files, the probability of false positives in the Bloom filters
> will increase. So hash partitions with Bloom filters index is  suitable for
> scenario where data increase can be estimated over a relatively small range.
>
>
> >>Because the hash partition field values under the parquet file in a
> >columnar storage format are all equal, the added column field hardly
> >occupies storage space after compression.
> >Any new meta field added adds other overhead in terms evolving the schema,
> >so forth. are you suggesting this is not possible to do without a new meta
> >field?
>
> No new meta field  implementation is a more elegant implementation, but
> for me, who is not yet familiar with the Hudi source code, it is somewhat
> difficult to implement, but it is not a problem for experts. If you want to
> implement it without adding new meta fields, I hope I can participate in
> some simple development, and I can also learn how experts can do it.
>
>
> >On Thu, Mar 16, 2023 at 2:22 AM 吕虎  wrote:
> >
> >> Hello,
> >>  I feel very honored that you are interested in my views.
> >>
> >>  Here are some of my thoughts marked with blue font.
> >>
> >> At 2023-03-16 13:18:08, "Vinoth Chandar"  wrote:
> >>
> >> >Thanks for the proposal! Some first set of questions here.
> >> >
> >> >>You need to pre-select the number of buckets and use the hash
> function to
> >> >determine which bucket a record belongs to.
> >> >>when building the table according to the estimated amount of data,
> and it
> >> >cannot be changed after building the table
> >> >>When the amount of data in a hash partition is too large, the data in
> >> that
> >> >partition will be split into multiple files in the way of Bloom index.
> >> >
> >> >All these issues are related to bucket sizing could be alleviated by
> the
> >> >consistent hashing index in 0.13? Have you checked it out? Love to hear
> >> >your thoughts on this.
> >>
> >> Hash partitioning is applicable to data tables that cannot give the
> exact
> >> capacity of data, but can estimate a rough range. For example, if a
> company
> >> currently has 300 million customers in the United States, the company
> will
> >> have 7 billion customers in the world at most. In this scenario, using
> hash
> >> partitioning to cope with data growth within the known range by directly
> >> adding files and establishing  bloom filters can still guarantee
> >> performance.
> >> The consistent hash bucket index is also very valuable, but when it is
> >> used for data expansion, it still involves the need to redistribute the
> >> data records of some data files, thus affecting the performance. When
> it is
> >> completely impossible to estimate the range of data capacity, it is very
> >> suitable to use consistent hashing.
> >> >> you can directly search the data under the partition, which greatly
> >> >reduces the scope of the Bloom filter to search for files and reduces
> the
> >> >false positive of the Bloom filter.
> >> >

Re: When using the HoodieDeltaStreamer, is there a corresponding parameter that can control the number of cycles? For example, if I cycle 5 times, I stop accessing data

2023-03-30 Thread Vinoth Chandar
I believe there is no control today. You could hack a precommit validator
and call System.exit if you want ;) (ugly, I know)

But maybe we could introduce some abstraction to do a check between loops?
or allow users to plugin some logic to decide whether to continue or exit?

Love to understand the use-case more here.

On Wed, Mar 29, 2023 at 7:32 AM lee  wrote:

> When I use the HoodieDeltaStreamer, the "-- continuous" parameter: "Delta
> Streamer runs in continuous mode running source match ->Transform ->Hudi
> Write in loop". So I would like to ask if there are any corresponding
> parameters that can control the number of cycles, such as stopping
> accessing data when I cycle 5 times.
>
>
>
> 李杰
> leedd1...@163.com
>
> 
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-03-30 Thread Vinoth Chandar
Essentially.

Old architecture :(operational database) ==> some tool ==> (data
warehouse raw data) ==> SQL ETL ==> (data warehouse derived data)

New architecture : (operational database) ==> Hudi delta Streamer ==> (Hudi
raw data) ==> Spark/Flink Hudi ETL ==> (Hudi derived data) ==> Hudi Reverse
Streamer ==> (Data Warehouse/Kafka/Operational Database)

On Thu, Mar 30, 2023 at 8:09 PM Vinoth Chandar  wrote:

> Hi all,
>
> Any interest in building a reverse streaming tool, that does the reverse
> of what the DeltaStreamer tool does? It will read Hudi table incrementally
> (only source) and write out the data to a variety of sinks - Kafka, JDBC
> Databases, DFS.
>
> This has come up many times with data warehouse users. Often times, they
> want to use Hudi to speed up or reduce costs on their data ingestion and
> ETL (using Spark/Flink), but want to move the derived data back into a data
> warehouse or an operational database for serving.
>
> What do you all think?
>
> Thanks
> Vinoth
>


[DISCUSS] Hudi Reverse Streamer

2023-03-30 Thread Vinoth Chandar
Hi all,

Any interest in building a reverse streaming tool, that does the reverse of
what the DeltaStreamer tool does? It will read Hudi table incrementally
(only source) and write out the data to a variety of sinks - Kafka, JDBC
Databases, DFS.

This has come up many times with data warehouse users. Often times, they
want to use Hudi to speed up or reduce costs on their data ingestion and
ETL (using Spark/Flink), but want to move the derived data back into a data
warehouse or an operational database for serving.

What do you all think?

Thanks
Vinoth


Re: [BIG CHANGE] Switch logger from log4j2 to slf4j

2023-03-22 Thread Vinoth Chandar
+1 as long as we don't break logging/bundling across all the 7 odd engines
Hudi is integrated into :)

On Wed, Feb 22, 2023 at 12:41 AM Danny Chan  wrote:

> Many popular Apache projects use slf4j now to avoid unnecessary
> conflicts, like the Apache Spark, Apache Flink,etc. slf4j is a bridge
> jar/interface for log4j/log4j2 to avoid conflicts, log4j2 is also a
> easy-conflicting jar even though it has more stable API than log4j
>
> As a bridge jar, slf4j relies on an impl jar like log4j2, that means
> both jars are needed to a complete logger system.
>
> For Hudi developers, we suggest always using slf4j instead for
> subsequent changes.
>
> Related change: https://github.com/apache/hudi/pull/7955
>
> Best,
> Danny
>


Re: Re: DISCUSS

2023-03-21 Thread Vinoth Chandar
>but when it is used for data expansion, it still involves the need to
redistribute the data records of some data files, thus affecting the
performance.
but expansion of the consistent hash index is an optional operation right?
Sorry, not still fully understanding the differences here,

>Because the hash partition field values under the parquet file in a
columnar storage format are all equal, the added column field hardly
occupies storage space after compression.
Any new meta field added adds other overhead in terms evolving the schema,
so forth. are you suggesting this is not possible to do without a new meta
field?

On Thu, Mar 16, 2023 at 2:22 AM 吕虎  wrote:

> Hello,
>  I feel very honored that you are interested in my views.
>
>  Here are some of my thoughts marked with blue font.
>
> At 2023-03-16 13:18:08, "Vinoth Chandar"  wrote:
>
> >Thanks for the proposal! Some first set of questions here.
> >
> >>You need to pre-select the number of buckets and use the hash function to
> >determine which bucket a record belongs to.
> >>when building the table according to the estimated amount of data, and it
> >cannot be changed after building the table
> >>When the amount of data in a hash partition is too large, the data in
> that
> >partition will be split into multiple files in the way of Bloom index.
> >
> >All these issues are related to bucket sizing could be alleviated by the
> >consistent hashing index in 0.13? Have you checked it out? Love to hear
> >your thoughts on this.
>
> Hash partitioning is applicable to data tables that cannot give the exact
> capacity of data, but can estimate a rough range. For example, if a company
> currently has 300 million customers in the United States, the company will
> have 7 billion customers in the world at most. In this scenario, using hash
> partitioning to cope with data growth within the known range by directly
> adding files and establishing  bloom filters can still guarantee
> performance.
> The consistent hash bucket index is also very valuable, but when it is
> used for data expansion, it still involves the need to redistribute the
> data records of some data files, thus affecting the performance. When it is
> completely impossible to estimate the range of data capacity, it is very
> suitable to use consistent hashing.
> >> you can directly search the data under the partition, which greatly
> >reduces the scope of the Bloom filter to search for files and reduces the
> >false positive of the Bloom filter.
> >the bloom index is already partition aware and unless you use the global
> >version can achieve this. Am I missing something?
> >
> >Another key thing is - if we can avoid adding a new meta field, that would
> >be great. Is it possible to implement this similar to bucket index, based
> >on jsut table properties?
> Add a hash partition field in the table to implement the hash partition
> function, which can well reuse the existing partition function, and
> involves very few code changes. Because the hash partition field values
> under the parquet file in a columnar storage format are all equal, the
> added column field hardly occupies storage space after compression.
> Of course, it is not necessary to add hash partition fields in the table,
> but to store hash partition fields in the corresponding metadata to achieve
> this function, but it will be difficult to reuse the existing functions.
> The establishment of hash partition and partition pruning during query need
> more time to develop code and test again.
> >On Sat, Feb 18, 2023 at 8:18 PM 吕虎  wrote:
> >
> >> Hi folks,
> >>
> >> Here is my proposal.Thank you very much for reading it.I am looking
> >> forward to your agreement  to create an RFC for it.
> >>
> >> Background
> >>
> >> In order to deal with the problem that the modification of a small
> amount
> >> of local data needs to rewrite the entire partition data, Hudi divided
> the
> >> partition into multiple File Groups, and each File Group is identified
> by
> >> the File ID. In this way, when a small amount of local data is modified,
> >> only the data of the corresponding File Group needs to be rewritten.
> Hudi
> >> consistently maps the given Hudi record to the File ID through the index
> >> mechanism. The mapping relationship between Record Key and File
> Group/File
> >> ID will not change once the first version of Record is determined.
> >>
> >> At present, Hudi's indexes mainly include Bloom filter index, Hbase
> >> index and bucket index. The Bloom filter index has a false positive
>

Re: About for 0.12.3 Release Timeline

2023-03-21 Thread Vinoth Chandar
Hi,

Given there are some critical regressions set to go, I would prefer to
scope down 0.12.3 to just the few PRs and get something out asap. Once
everyone returns, we can drive a 0.12.4 on top? We can then take even till
end of April
Others, thoughts?

On Mon, Mar 20, 2023 at 23:39 Forward Xu  wrote:

> Hi folks,
>
> How about April 10th as our release date for 0.12.3? Considering that from
> now to April 10th includes the traditional Chinese festival Qingming
> Festival and the full testing schedule.
>
> ForwardXu
> Best
>


Re: DISCUSS

2023-03-16 Thread Vinoth Chandar
Thanks for the proposal! Some first set of questions here.

>You need to pre-select the number of buckets and use the hash function to
determine which bucket a record belongs to.
>when building the table according to the estimated amount of data, and it
cannot be changed after building the table
>When the amount of data in a hash partition is too large, the data in that
partition will be split into multiple files in the way of Bloom index.

All these issues are related to bucket sizing could be alleviated by the
consistent hashing index in 0.13? Have you checked it out? Love to hear
your thoughts on this.

> you can directly search the data under the partition, which greatly
reduces the scope of the Bloom filter to search for files and reduces the
false positive of the Bloom filter.
the bloom index is already partition aware and unless you use the global
version can achieve this. Am I missing something?

Another key thing is - if we can avoid adding a new meta field, that would
be great. Is it possible to implement this similar to bucket index, based
on jsut table properties?

On Sat, Feb 18, 2023 at 8:18 PM 吕虎  wrote:

> Hi folks,
>
> Here is my proposal.Thank you very much for reading it.I am looking
> forward to your agreement  to create an RFC for it.
>
> Background
>
> In order to deal with the problem that the modification of a small amount
> of local data needs to rewrite the entire partition data, Hudi divided the
> partition into multiple File Groups, and each File Group is identified by
> the File ID. In this way, when a small amount of local data is modified,
> only the data of the corresponding File Group needs to be rewritten. Hudi
> consistently maps the given Hudi record to the File ID through the index
> mechanism. The mapping relationship between Record Key and File Group/File
> ID will not change once the first version of Record is determined.
>
> At present, Hudi's indexes mainly include Bloom filter index, Hbase
> index and bucket index. The Bloom filter index has a false positive
> problem. When a large amount of data results in a large number of File
> Groups, the false positive problem will magnify and lead to poor
> performance. The Hbase index depends on the external Hbase database, and
> may be inconsistent, which will ultimately increase the operation and
> maintenance costs. Bucket index makes each bucket of the bucket index
> correspond to a File Group. You need to pre-select the number of buckets
> and use the hash function to determine which bucket a record belongs to.
> Therefore, you can directly determine the mapping relationship between the
> Record Key and the File Group/File ID through the hash function. Using the
> bucket index, you need to determine the number of buckets in advance when
> building the table according to the estimated amount of data, and it cannot
> be changed after building the table. The unreasonable number of buckets
> will seriously affect performance. Unfortunately, the amount of data is
> often unpredictable and will continue to grow.
>
> Hash partition feasibility
>
>  In this context, I put forward the idea of hash partitioning. The
> principle is similar to bucket index, but in addition to the advantages of
> bucket index, hash partitioning can retain the Bloom index. When the amount
> of data in a hash partition is too large, the data in that partition will
> be split into multiple files in the way of Bloom index. Therefore, the
> problem that bucket index depends heavily on the number of buckets does not
> exist in the hash partition. Compared with the Bloom index, when searching
> for a data, you can directly search the data under the partition, which
> greatly reduces the scope of the Bloom filter to search for files and
> reduces the false positive of the Bloom filter.
>
> Design of a simple hash partition implementation
>
> The idea is to use the capabilities of the ComplexKeyGenerator to
> implement hash partitioning. Hash partition field is one of the partition
> fields of the ComplexKeyGenerator.
>
>When hash.partition.fields is specified and partition.fields contains
> _hoodie_hash_partition, a column named _hoodie_hash_partition will be added
> in this table as one of the partition key.
>
> If predicates of hash.partition.fields appear in the query statement, the
> _hoodie_hash_partition = X predicate will be automatically added to the
> query statement for partition pruning.
>
> Advantages of this design: simple implementation, no modification of core
> functions, so low risk.
>
> The above design has been implemented in pr 7984.
>
> https://github.com/apache/hudi/pull/7984
>
> How to use hash partition in spark data source can refer to
>
>
> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
>
> #testHashPartition
>
> Perhaps for experts, the implementation of PR is not elegant enough. I
> also look 

Re: [DISCUSS] Build tool upgrade

2023-02-13 Thread Vinoth Chandar
This is cool! :)

On Mon, Feb 13, 2023 at 2:02 PM Daniel Kaźmirski 
wrote:

> Hi,
>
> I did try to add the mentioned extension to Hudi pom. Here are the results:
>
> Clean with cache extension disabled
> mvn clean package -DskipTests -Dspark3.3 -Dscala-2.12
> -Dmaven.build.cache.enabled=false
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 10:57 min
>
> With cache after changing HoodieSpark33CatalogUtils
> mvn package -DskipTests -Dspark3.3 -Dscala-2.12
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 03:52 min
>
> With cache no changes
> mvn package -DskipTests -Dspark3.3 -Dscala-2.12
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] Total time: 3.485 s
>
> If anyone would like to try it out:
> https://github.com/apache/hudi/pull/7935
>
> BR,
> Daniel
>
> pt., 10 lut 2023 o 16:59 Daniel Kaźmirski 
> napisał(a):
>
> > Hi all,
> >
> > Going back to this topic, Maven 3.9.0 has been released recently along
> > with a new build cache extension that provides incremental builds:
> > https://maven.apache.org/extensions/maven-build-cache-extension/
> > Might be worth considering.
> >
> > pon., 24 paź 2022 o 19:59 Shiyan Xu 
> > napisał(a):
> >
> >> Thank you all for the valuable inputs! I think we can close this topic
> for
> >> now, given the majority is leaning towards continuing with maven.
> >>
> >> On Mon, Oct 17, 2022 at 8:48 PM zhaojing yu 
> wrote:
> >>
> >> > I have experienced some gradle development projects and want to share
> >> some
> >> > thoughts.
> >> >
> >> > The flexibility and faster speed of gradle itself can certainly bring
> >> some
> >> > advantages, but it will also greatly increase the troubleshooting time
> >> due
> >> > to the bugs of gradle itself, and gradle DSL is very different from
> >> that of
> >> > maven. There are also many learning costs for developers in the
> >> community.
> >> >
> >> > I think it does consume too much time on code release, but users or
> >> > developers usually only compile part of the module.
> >> >
> >> > So I think, a certain advantage in build time alone is not enough to
> >> cover
> >> > so much cost.
> >> >
> >> > Best,
> >> > Zhaojing
> >> >
> >> > Gary Li  于2022年10月17日周一 19:22写道:
> >> >
> >> > > Hi folks,
> >> > >
> >> > > I'd share my thoughts as well. I personally won't build the whole
> >> project
> >> > > too often, only before push to the remote branch or make big changes
> >> in
> >> > > different modules. If I just make some changes and run a test, the
> IDE
> >> > will
> >> > > only build the necessary modules I believe. In addition, each time I
> >> deal
> >> > > with dependency issues, the years of maven experience does help me
> >> locate
> >> > > the issue quickly, especially when the dependency tree is pretty
> >> > > complicated. The learning effort and the new environment setup
> effort
> >> are
> >> > > considerable as well.
> >> > >
> >> > > Happy to learn if there are other benefits gradle or bazel could
> >> bring to
> >> > > us, but if the only benefit is the xx% faster build time, I am a bit
> >> > > unconvinced to make this change.
> >> > >
> >> > > Best,
> >> > > Gary
> >> > >
> >> > > On Mon, Oct 17, 2022 at 2:58 PM Danny Chan 
> >> wrote:
> >> > >
> >> > > > I have a full experience with how Apache Calcite switches from
> Maven
> >> > > > to Gradle, and I want to share some thoughts.
> >> > > >
> >> > > > The gradle build is fast, but it relies heavily on its local
> cache,
> >> > > > usually it needs too much time to download these cache jars
> because
> >> > > > gradle upgrade itself very frequently.
> >> > > >
> >> > > > The gradle is very flexive for building, but it also has many
> bugs,
> >> > > > you may need more time to debug its bug compared with building
> with
> >> > > > maven.
> >> > > >
> >> > > > The gradle DSL for building is a must to learn for all the
> >> developers.
> >> > > >
> >> > > > For all above cases, I don't think switching to gradle is a right
> >> > > > decision for Apache Calcite. Julian Hyde which is the creator of
> >> > > > Calcite may have more words to say here.
> >> > > >
> >> > > > So I would not suggest we do that for Hudi.
> >> > > >
> >> > > >
> >> > > > Best,
> >> > > > Danny Chan
> >> > > >
> >> > > > Shiyan Xu  于2022年10月1日周六 13:48写道:
> >> > > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > I'd like to raise a discussion around the build tool for Hudi.
> >> > > > >
> >> > > > > Maven has been a mature yet slow (10min to build on 2021 macbook
> >> pro)
> >> > > > build
> >> > > > > tool compared to modern ones like gradle or bazel. We all want
> >> faster
> >> > > > > builds, however, we also need to consider the efforts and risks
> to
> >> > > > upgrade,
> >> > > > > and the developers' feedback on usability.
> >> > > > >
> >> > > > > What do you all think 

Re: [ANNOUNCE] Apache Hudi 0.12.1 released

2022-10-19 Thread Vinoth Chandar
Great job everyone!

On Wed, Oct 19, 2022 at 07:11 zhaojing yu  wrote:

> The Apache Hudi team is pleased to announce the release of Apache Hudi
> 0.12.1.
>
> Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes
> and Incrementals. Apache Hudi manages storage of large analytical
> datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible
> storage) and provides the ability to query them.
>
> This release comes 2 months after 0.12.0. It includes more than
> 150 resolved issues, comprising of a few new features as well as
> general improvements and bug fixes. You can read the release
> highlights at https://hudi.apache.org/releases/release-0.12.1.
>
> For details on how to use Hudi, please look at the quick start page located
> at https://hudi.apache.org/docs/quick-start-guide.html
>
> If you'd like to download the source release, you can find it here:
> https://github.com/apache/hudi/releases/tag/release-0.12.1
>
> Release notes including the resolved issues can be found here:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352182
>
> We welcome your help and feedback. For more information on how to report
> problems, and to get involved, visit the project website at
> https://hudi.apache.org
>
> Thanks to everyone involved!
>
> Release Manager
>


Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Vinoth Chandar
+1 love to discuss this on a RFC proposal.

On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin  wrote:

> That's a very interesting idea.
>
> Do you want to take a stab at writing a full proposal (in the form of RFC)
> for it?
>
> On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang 
> wrote:
>
> > Hi all,
> >
> > Do we have plan to integrate data TTL into HUDI, so we don't have to
> > schedule a offline spark job to delete outdated data, just set a TTL
> > config, then writer or some offline service will delete old data as
> > expected.
> >
>


Re: [DISCUSS] Build tool upgrade

2022-09-30 Thread Vinoth Chandar
Hi Raymond.

This would be a large undertaking and a big change for everyone.

What does the build time look like if we switch to gradle or bazel? And do
we know why it takes 10 min to build and why is that not okay? Given we all
use IDEs mostly anyway

Thanks
Vinoth

On Fri, Sep 30, 2022 at 22:48 Shiyan Xu  wrote:

> Hi all,
>
> I'd like to raise a discussion around the build tool for Hudi.
>
> Maven has been a mature yet slow (10min to build on 2021 macbook pro) build
> tool compared to modern ones like gradle or bazel. We all want faster
> builds, however, we also need to consider the efforts and risks to upgrade,
> and the developers' feedback on usability.
>
> What do you all think about upgrading to gradle or bazel? Please share your
> thoughts. Thanks.
>
> --
> Best,
> Shiyan
>


Re: 0.12.1 release timeline

2022-09-20 Thread Vinoth Chandar
Thanks for sharing. Do we have an ETA for these?

Zhaojing - please chime with your thoughts as well

On Wed, Sep 21, 2022 at 06:34 Y Ethan Guo  wrote:

> Hi Zhaojing,
>
> It would be good if we can land the following bootstrap fixes for 0.12.1
> release.  I'm working on getting them merged.
>
> HUDI-4855: https://github.com/apache/hudi/pull/6694
> HUDI-4453: https://github.com/apache/hudi/pull/6676
>
> Thanks,
> - Ethan
>
> On Tue, Sep 20, 2022 at 12:03 PM Alexey Kudinkin 
> wrote:
>
> > There are also a few critical issues we want to address before cutting
> the
> > 0.12.1 release:
> >
> > HUDI-4760 <https://issues.apache.org/jira/browse/HUDI-4760>
> > HUDI-3636 <https://issues.apache.org/jira/browse/HUDI-3636>
> > HUDI-4885 <https://issues.apache.org/jira/browse/HUDI-4885>
> > HUDI-2780 <https://issues.apache.org/jira/browse/HUDI-2780>
> >
> > On Tue, Sep 20, 2022 at 10:14 AM Sivabalan  wrote:
> >
> > > sure. We have few critical PRs that we are looking to land. Few notable
> > > ones are
> > >
> > > ClassNotFoundException when using hudi-spark-bundle to write table with
> > > hbase index <https://github.com/apache/hudi/pull/6715>
> > > Fix fq can not be queried in pending compaction when query ro table
> with
> > > spark <https://github.com/apache/hudi/pull/6516>
> > > Syncing non-partitioned table has bugs around partition parameters
> > > <https://github.com/apache/hudi/pull/6525>
> > > bootstrap bug fixes: https://github.com/apache/hudi/pull/6694 and
> > > https://github.com/apache/hudi/pull/6676
> > >
> > >
> > > On Mon, 19 Sept 2022 at 20:24, Vinoth Chandar 
> wrote:
> > >
> > > > tbh the RM can make this call. Whether or not 1 week is aggressive,
> > > really
> > > > depends on the scope of release, whats left to land/test.
> > > >
> > > > Would it be useful to frame the discussion in that way?
> > > >
> > > > On Mon, Sep 19, 2022 at 1:25 PM zhaojing yu 
> > wrote:
> > > >
> > > > > Do anyone else have any suggestions?
> > > > > We will determine the time of the code freeze tomorrow.
> > > > >
> > > > > Sivabalan  于2022年9月19日周一 14:05写道:
> > > > >
> > > > > > Hey hi Zhaojing,
> > > > > >   Announcing a code freeze just 1 week ahead might be too
> > > > aggressive.
> > > > > > Do you think, we can make it sometime next week(week of 26th) to
> > give
> > > > > some
> > > > > > buffer for folks to push any critical fixes in. Open to hear what
> > > > others
> > > > > > have to say.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, 16 Sept 2022 at 01:47, zhaojing yu 
> > > > wrote:
> > > > > >
> > > > > > > To clarify, 09/21 is to cut RC1 and it will be released if all
> > > > > > > testing/checks pass.
> > > > > > >
> > > > > > > zhaojing yu  于2022年9月16日周五 16:45写道:
> > > > > > >
> > > > > > > > Hi folks,
> > > > > > > >
> > > > > > > > As the RM for the 0.12.1 release, I'd like to propose the
> code
> > > > freeze
> > > > > > on
> > > > > > > > Sep 21 (Wed) for any bug fixes that are going to be included
> in
> > > the
> > > > > > minor
> > > > > > > > release, about a month after the 0.12.0 release.  Let me know
> > if
> > > > you
> > > > > > need
> > > > > > > > more time for fixing any issues.
> > > > > > > >
> > > > > > > > Please tag any fix that you think we should include in
> 0.12.1,
> > by
> > > > > > setting
> > > > > > > > the "Fix Version/s" to "0.12.1" in the corresponding Jira
> > ticket.
> > > > As
> > > > > > the
> > > > > > > > RM, I will make the final decision.  I have started
> > > cherry-picking
> > > > > the
> > > > > > > > commits from the master.  I will watch out for ongoing
> critical
> > > > fixes
> > > > > > and
> > > > > > > > remind authors and reviewers in the PRs along the way so they
> > can
> > > > > land
> > > > > > in
> > > > > > > > time.
> > > > > > > >
> > > > > > > > cherry-picking link:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1G4eeZkSUqMgONRI2YE0o_XBxy9kBaq_Cznolfp2VhS8/edit#heading=h.3fl3egu0kv0z
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > - Zhaojing
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > > -Sivabalan
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>


Re: 0.12.1 release timeline

2022-09-19 Thread Vinoth Chandar
tbh the RM can make this call. Whether or not 1 week is aggressive, really
depends on the scope of release, whats left to land/test.

Would it be useful to frame the discussion in that way?

On Mon, Sep 19, 2022 at 1:25 PM zhaojing yu  wrote:

> Do anyone else have any suggestions?
> We will determine the time of the code freeze tomorrow.
>
> Sivabalan  于2022年9月19日周一 14:05写道:
>
> > Hey hi Zhaojing,
> >   Announcing a code freeze just 1 week ahead might be too aggressive.
> > Do you think, we can make it sometime next week(week of 26th) to give
> some
> > buffer for folks to push any critical fixes in. Open to hear what others
> > have to say.
> >
> >
> >
> > On Fri, 16 Sept 2022 at 01:47, zhaojing yu  wrote:
> >
> > > To clarify, 09/21 is to cut RC1 and it will be released if all
> > > testing/checks pass.
> > >
> > > zhaojing yu  于2022年9月16日周五 16:45写道:
> > >
> > > > Hi folks,
> > > >
> > > > As the RM for the 0.12.1 release, I'd like to propose the code freeze
> > on
> > > > Sep 21 (Wed) for any bug fixes that are going to be included in the
> > minor
> > > > release, about a month after the 0.12.0 release.  Let me know if you
> > need
> > > > more time for fixing any issues.
> > > >
> > > > Please tag any fix that you think we should include in 0.12.1, by
> > setting
> > > > the "Fix Version/s" to "0.12.1" in the corresponding Jira ticket.  As
> > the
> > > > RM, I will make the final decision.  I have started cherry-picking
> the
> > > > commits from the master.  I will watch out for ongoing critical fixes
> > and
> > > > remind authors and reviewers in the PRs along the way so they can
> land
> > in
> > > > time.
> > > >
> > > > cherry-picking link:
> > > >
> > >
> >
> https://docs.google.com/document/d/1G4eeZkSUqMgONRI2YE0o_XBxy9kBaq_Cznolfp2VhS8/edit#heading=h.3fl3egu0kv0z
> > > >
> > > > Thanks,
> > > > - Zhaojing
> > > >
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


Re: [ANNOUNCE] Apache Hudi 0.12.0 released

2022-08-18 Thread Vinoth Chandar
Great job, Sagar! Huge congratulations to the entire community in getting
this out!

On Thu, Aug 18, 2022 at 10:45 PM sagar sumit  wrote:

> The Apache Hudi team is pleased to announce the release of Apache Hudi
> 0.12.0.
>
> Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes
> and Incrementals. Apache Hudi manages storage of large analytical
> datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible
> storage) and provides the ability to query them.
>
> This release comes 2 months after 0.11.1. It includes more than
> 250 resolved issues, comprising of a few new features as well as
> general improvements and bug fixes. You can read the release
> highlights at https://hudi.apache.org/releases/release-0.12.0.
>
> For details on how to use Hudi, please look at the quick start page located
> at https://hudi.apache.org/docs/quick-start-guide.html
>
> If you'd like to download the source release, you can find it here:
> https://github.com/apache/hudi/releases/tag/release-0.12.0
>
> Release notes including the resolved issues can be found here:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12351209
>
> We welcome your help and feedback. For more information on how to report
> problems, and to get involved, visit the project website at
> https://hudi.apache.org
>
> Thanks to everyone involved!
>
> Release Manager
>


Re: [VOTE] Release 0.12.0, release candidate #2

2022-08-14 Thread Vinoth Chandar
+1 (binding)

On Sun, Aug 14, 2022 at 14:50 Bhavani Sudha  wrote:

> +1 (binding)
>
>
> [OK] Build successfully all supported spark version
>
> [OK] Ran validation script
>
> [OK] Ran quickstart tests with spark 2.4
>
> [OK] Ran some IDE tests
>
>
> sudha[9:33:26] scripts % ./release/validate_staged_release.sh
> --release=0.12.0 --rc_num=2
>
> /tmp/validation_scratch_dir_001 ~/hudi/scripts
>
> Downloading from svn co https://dist.apache.org/repos/dist/dev/hudi
>
> Validating hudi-0.12.0-rc2 with release type "dev"
>
> Checking Checksum of Source Release
>
> Checksum Check of Source Release - [OK]
>
>
>   % Total% Received % Xferd  Average Speed   TimeTime Time
> Current
>
>  Dload  Upload   Total   SpentLeft
> Speed
>
> 100 62287  100 622870 0  39174  0  0:00:01  0:00:01 --:--:--
> 39149
>
> Checking Signature
>
> Signature Check - [OK]
>
>
> Checking for binary files in source release
>
> No Binary Files in Source Release? - [OK]
>
>
> Checking for DISCLAIMER
>
> DISCLAIMER file exists ? [OK]
>
>
> Checking for LICENSE and NOTICE
>
> License file exists ? [OK]
>
> Notice file exists ? [OK]
>
>
> Performing custom Licensing Check
>
> Licensing Check Passed [OK]
>
>
> Running RAT Check
>
> RAT Check Passed [OK]
>
>
> Thanks,
>
> Sudha
>
> On Sun, Aug 14, 2022 at 11:16 AM Y Ethan Guo  wrote:
>
> > +1 (non-binding)
> >
> > - [OK] checksums and signatures
> > - [OK] ran release validation script
> > - [OK] built successfully (Spark 2.4, 3.2, 3.3)
> > - [OK] ran Spark quickstart with Spark 3.3.0
> > - [OK] ran a few tests on schema evolution
> > - [OK] Presto connector performance
> >
> > Best,
> > - Ethan
> >
> > On Thu, Aug 11, 2022 at 5:22 AM sagar sumit  wrote:
> >
> > > Hi everyone,
> > >
> > > Please review and vote on the release candidate #2 for the version
> > 0.12.0,
> > > as follows:
> > >
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > > The complete staging area is available for your review, which includes:
> > >
> > > * JIRA release notes [1],
> > > * the official Apache source release and binary convenience releases to
> > be
> > > deployed to dist.apache.org [2], which are signed with the key with
> > > fingerprint FD215342E3199419ADFBF41DD4623E3AA16D75B0 [3],
> > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > * source code tag "release-0.12.0-rc2" [5],
> > >
> > > The vote will be open for at least 72 hours. It is adopted by majority
> > > approval, with at least 3 PMC affirmative votes.
> > >
> > > Thanks,
> > > Release Manager
> > >
> > > [1]
> > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12351209
> > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.12.0-rc2/
> > > [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
> > > [4]
> > https://repository.apache.org/content/repositories/orgapachehudi-1090/
> > > [5] https://github.com/apache/hudi/releases/tag/release-0.12.0-rc2
> > >
> >
>


Re: 0.12.0 Release Timeline

2022-08-10 Thread Vinoth Chandar
Hello.

Any updates on RC2? :)

On Sat, Aug 6, 2022 at 10:36 AM sagar sumit  wrote:

> Hi folks,
>
> Thanks for voting on RC1.
> I will be preparing RC2 by Monday, 8th August end of day PST,
> and I will send out a separate voting email for RC2.
>
> Regards,
> Sagar
>
> On Fri, Jul 29, 2022 at 6:08 PM sagar sumit 
> wrote:
>
> > We can now resume merging to master branch.
> > Thanks for your patience.
> >
> > Regards,
> > Sagar
> >
>


Re: [DISCUSS]: Integrate column stats index with all query engines

2022-08-10 Thread Vinoth Chandar
+1 for this.

Suggested new reviewers on the RFC.
https://github.com/apache/hudi/pull/6345/files#r943073339

On Wed, Aug 10, 2022 at 9:56 PM Pratyaksh Sharma 
wrote:

> Hello community,
>
> With the introduction of multi modal index in Hudi, there is a lot of scope
> for improvement on the querying side. There are 2 major ways of reducing
> the data scan at the time of querying - partition pruning and file pruning.
> While with the latest developments in the community, partition pruning is
> supported for commonly used query engines like spark, presto and hive, File
> pruning using column stats index is only supported for spark and flink.
>
> We intend to support data skipping for the rest of the engines as well
> which include hive, presto and trino. I have written a draft RFC here -
> https://github.com/apache/hudi/pull/6345.
>
> Please take a look and let me know what you think. Once we have some
> feedback from the community, we can decide on the next steps.
>


Re: Joining Slack workspace

2022-07-31 Thread Vinoth Chandar
Hi Siva,

has the site link been fixed?

Thanks
Vinoth

On Fri, Jul 29, 2022 at 11:06 AM Sivabalan  wrote:

> thanks for confirming.
>
> On Fri, 29 Jul 2022 at 09:35, Ken Krugler 
> wrote:
>
> > That worked, thanks!
> >
> > — Ken
> >
> > > On Jul 28, 2022, at 8:11 PM, Sivabalan  wrote:
> > >
> > > Can you try this
> > > <
> >
> https://join.slack.com/t/apache-hudi/shared_invite/zt-1d5zjsfl3-d_TefVaGyvEe16EANrxz6Q
> > >
> > > please.
> > > If this works for you, I will fix the website w/ the right link.
> > >
> > >
> > >
> > >
> > > On Fri, 29 Jul 2022 at 02:57, Ken Krugler  >
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> I’m wondering if adding a comment to
> > >> https://github.com/apache/hudi/issues/143 <
> > >> https://github.com/apache/hudi/issues/143> is still the best way to
> get
> > >> an invite.
> > >>
> > >> The old invite link <
> > >>
> >
> https://join.slack.com/t/apache-hudi/shared_invite/zt-1c44wsfgl-gX_6f50DezdnWVrGsxC~Ug
> > >
> > >> no longer works, and I don’t see activity on the Github issue (above),
> > so
> > >> wondering if it’s still being monitored.
> > >>
> > >> Thanks!
> > >>
> > >> — Ken
> > >>
> > >> --
> > >> Ken Krugler
> > >> http://www.scaleunlimited.com
> > >> Custom big data solutions
> > >> Flink, Pinot, Solr, Elasticsearch
> > >>
> > >>
> > >>
> > >>
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> >
> > --
> > Ken Krugler
> > http://www.scaleunlimited.com
> > Custom big data solutions
> > Flink, Pinot, Solr, Elasticsearch
> >
> >
> >
> >
>
> --
> Regards,
> -Sivabalan
>


Re: 0.12.0 Release Timeline

2022-07-14 Thread Vinoth Chandar
+1 from me.

On Thu, Jul 14, 2022 at 9:43 AM sagar sumit  wrote:

> Hi Folks,
>
> After some deliberation with the community and keeping the release blockers
> <
> https://github.com/apache/hudi/pulls?q=is%3Apr+is%3Aopen+label%3Apriority%3Ablocker
> >
> in
> mind,
> I am proposing a new date for code freeze *July 25, 11:59 PM PST*.
>
> Please highlight any concerns you may have.
>
> Regards,
> Sagar
>
> On Wed, Jul 13, 2022 at 5:59 PM Shawy Geng 
> wrote:
>
> > July 30 may be better for me. There are some performance issues of spark
> > row writing that need to be fixed and it needs more detailed benchmark
> > test.
> >
> > sagar sumit  于2022年7月11日周一 12:59写道:
> >
> > > Hi Folks,
> > >
> > > Some excellent features from the community are in review and
> near-landing
> > > (could take up to a week).
> > > The release blockers are tracked here
> > > <
> > >
> >
> https://github.com/apache/hudi/pulls?q=is%3Apr+is%3Aopen+label%3Apriority%3Ablocker
> > > >
> > > .
> > > So, I'd like to propose an updated release timeline.
> > > - *July 20, 11:59 PM PST*: Code freeze - new features/functionalities
> > > won't be merged to master.
> > > - *July 22, 11:59 PM PST*: Cut release branch and start RC
> > voting/testing.
> > >
> > > Regards,
> > > Sagar
> > >
> > > On Wed, Jun 29, 2022 at 2:54 PM sagar sumit  wrote:
> > >
> > > > Hi Folks,
> > > >
> > > > I am the RM for the upcoming release 0.12.0.
> > > > In line with our roadmap, I'd like to propose the following timeline:
> > > >
> > > > - *July 15, 11:59 PM PST*: Code freeze - new features/functionalities
> > > > won't be merged to master.
> > > > - *July 18, 11:59 PM PST*: Cut release branch and start RC
> > > voting/testing.
> > > >
> > > > Please highlight any concerns with the timeline.
> > > > If it works, please +1 on this thread.
> > > >
> > > > Also, if you have not done already, please tag any JIRAs that you
> have
> > > > planned for the release by setting its "Fix Version/s" to "0.12.0"
> > > >
> > > > Regards,
> > > > Sagar
> > > >
> > >
> >
>


Re: native interface support

2022-07-12 Thread Vinoth Chandar
Hi all,

Overall +1. Love to take this idea forward. At the very least, a good C++
API for HoodieTimeline and HoodieTableFileSystemView should be enough to
get COW working end-end.
One issue is the lack of a standard HFile reader in C++ and given a lot of
our metadata layer uses the HFile format, we may have to invest/investigate
this more.

Thanks
Vinoth

On Tue, Jul 5, 2022 at 9:10 PM Forward Xu  wrote:

> hi dujunling!
> Glad to hear that native interface for native SQL engine. Let's start with
> this email.
>
> Best,
> ForwardXu
>
> junling du  于2022年7月4日周一 17:01写道:
>
> > Hi folks,
> >   I am an engineer from bytedance. I am working for integrate hudi on
> > native sql engine such as Apache Doris.  Currently hudi does not have a
> > native interface such as c++/Rust,  It is very inconvenient when
> > integrating with the native SQL engine.
> > Does community has a plan to provide native interface?
> >
> > dujunling
> > 2022-7-4
> >
>


Re: [DISCUSS] Diagnostic reporter

2022-06-15 Thread Vinoth Chandar
+1 from me.

It will be very useful if we can have something that can gather
troubleshooting info easily.
This part takes a while currently.

On Mon, May 30, 2022 at 9:52 AM Shiyan Xu 
wrote:

> Hi all,
>
> When troubleshooting Hudi jobs in users' environments, we always ask users
> to share configs, environment info, check spark UI, etc. Here is an RFC
> idea: can we extend the Hudi metrics system and make a diagnostic reporter?
> It can be turned on like a normal metrics reporter. it should collect
> common troubleshooting info and save to json or other human-readable text
> format. Users should be able to run with it and share the diagnosis file.
> The RFC should discuss what info should / can be collected.
>
> Does this make sense? Anyone interested in driving the RFC design and
> implementation work?
>
> --
> Best,
> Shiyan
>


Re: [DISCUSS] Hudi sync meetings for Chinese community

2022-05-26 Thread Vinoth Chandar
Great! Thanks for volunteering

On Thu, May 26, 2022 at 02:09 Shiyan Xu  wrote:

> Awesome! looking forward to an initial proposal!
>
> On Thu, May 26, 2022 at 4:17 PM Shimin Yang  wrote:
>
> > Hi Shiyan, I'm from bytedance data lake team, and our team would like to
> > drive and host the hudi sync meetsing for Chinese community.
> >
> > Shiyan Xu  于2022年5月26日周四 16:14写道:
> >
> > > Related info: we are noting down the current community sync info here
> > > https://hudi.apache.org/community/syncs
> > >
> > >
> > > On Thu, May 26, 2022 at 3:44 PM Shiyan Xu  >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > This is a topic brought up previously, and also recently raised in
> this
> > > > issue : we are thinking
> of
> > > > hosting another regular sync meeting for the Chinese community,
> having
> > a
> > > > more suitable time and converse in Chinese. This requires efforts on
> > > > coordinating time, agenda, speakers, and hosting platform. Hence I
> > would
> > > > like to call out for volunteers to start driving this. Thank you.
> > > >
> > > > --
> > > > Best,
> > > > Shiyan
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Shiyan
> > >
> >
>
>
> --
> Best,
> Shiyan
>


Re: 0.11.1 release timeline

2022-05-24 Thread Vinoth Chandar
+1 as well.

On Mon, May 23, 2022 at 23:59 Shiyan Xu  wrote:

> +1 on the timeline. So to clarify, basically 06/01 is to cut RC1 and it
> will be released if all testing/checks pass, right?
>
> On Tue, May 24, 2022 at 12:59 PM Y Ethan Guo  wrote:
>
> > Hi folks,
> >
> > As the RM for the 0.11.1 release, I'd like to propose the code freeze on
> > Jun 1st (Wed) for any bug fixes that are going to be included in the
> minor
> > release, about a month after the 0.11.0 release.  Let me know if you need
> > more time for fixing any issues.
> >
> > Please tag any fix that you think we should include in 0.11.1, by setting
> > the "Fix Version/s" to "0.11.1" in the corresponding Jira ticket.  As the
> > RM, I will make the final decision.  I have started cherry-picking the
> > commits from the master.  I will watch out for ongoing critical fixes and
> > remind authors and reviewers in the PRs along the way so they can land in
> > time.
> >
> > Thanks,
> > - Ethan
> >
>
>
> --
> Best,
> Shiyan
>


Re: [VOTE] Monthly Community Sync Time

2022-05-17 Thread Vinoth Chandar
+1 for changing.

9AM is my preference

On Tue, May 17, 2022 at 1:20 PM Bhavani Sudha 
wrote:

> Hi everyone,
>
> The Community sync happens last Wednesday of every month. Currently it is
> scheduled at 7 am which is way too early for a lot of folks. Following are
> proposed times for the meeting. Please find the
> corresponding discuss thread here
>  for
> more
> context.
>
> - 8:00 AM pacific time
> - 8:30 AM pacific time
> - 9:00 AM pacific time
>
> Please indicate your preferred time. This thread will be open for 72 hours.
> And will be adopted by majority approval after that.
>
> Thanks,
> Sudha
>


Re: [ANNOUNCE] Apache Hudi 0.11.0 released

2022-05-02 Thread Vinoth Chandar
+1 this was a very well coordinated release. Took tons of dedication. Thank
you Raymond!

On Mon, May 2, 2022 at 20:24 Forward Xu  wrote:

> Thank you raymond for your hard work and dedication.
>
> forwardxu
> best
>
> Shiyan Xu  于2022年5月3日周二 09:51写道:
>
> > The Apache Hudi team is pleased to announce the release of Apache Hudi
> > 0.11.0. This has been a huge community effort with 638 commits from 61
> > contributors across the globe.
> >
> >
> > Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes
> >
> > and Incrementals. Apache Hudi manages storage of large analytical
> >
> > datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible
> > storage)
> >
> > and provides the ability to query them.
> >
> >
> > This release comes 3 months after 0.10.1. You can read more about the
> > release
> > notes here:
> >
> > https://hudi.apache.org/releases/release-0.11.0
> >
> > For details on how to use Hudi, please look at the quick start page
> located
> > at
> >
> > https://hudi.apache.org/docs/quick-start-guide.html
> >
> > If you'd like to download the source release, you can find it here:
> >
> > https://github.com/apache/hudi/releases/tag/release-0.11.0
> > 
> >
> > We welcome your help and feedback. For more information on how to
> >
> > report problems, and to get involved, visit the project website at:
> >
> >
> > http://hudi.apache.org/
> >
> > Thanks to everyone involved!
> >
> > Release Manager
> >
>


Re: Spark structured streaming and Spark SQL improvements

2022-04-27 Thread Vinoth Chandar
Thanks the thoughtful note, Daniel!

All of 1-3 looks good to me. Yann/Raymond or other spark usuals here, any
thoughts on adding these for 0.12?

0.12 we want to get schema evolution to GA. That's also a very useful
suggestion. Tao (author for Schema evolution), any thoughts?

On Mon, Apr 25, 2022 at 4:39 PM Daniel Kaźmirski 
wrote:

> Hi,
>
> I would like to propose a few additions to Spark structured streaming in
> Hudi and spark sql improvements. These would make my life easier as a Hudi
> user, so this is from user perspective, not sure how about the
> implementation side :)
>
> Spark Structured Streaming:
> 1. As a user, I would like to be able to specify starting instant position
> in for reading Hudi table streaming query, this is not possible in
> structured streaming right now, it starts streaming data from the earliest
> available instant or from instant saved in checkpoint.
>
> 2. In Hudi 0.11 it's possible to fallback to full table scan in absence of
> commits afaik, this is used in delta streamer. I would like to have the
> same functionality in structured streaming query.
>
> 3. I would like to be able to limit input rate when reading stream from
> Hudi table. I'm thinking about adding maxInstantsPerTrigger/
> maxBytesPerTrigger. Eg I would like to have 100 instants per trigger in my
> micro batch.
>
> Spark SQL:
> Since 0.11 we get very flexible schema evolution. Therefore can we as users
> automatically evolve schema on MERGE INTO operations?
> I guess this should only be supported when we use update set * and insert *
> in merge operation.
> In case of missing columns, reconcile schema functionality can be used.
>
>
> Best Regards,
> Daniel Kaźmirski
>


Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

2022-04-27 Thread Vinoth Chandar
I left my thoughts on the RFC https://github.com/apache/hudi/pull/4309

I just see this as a another deployment model where a centralized set of
microservices take up scheduling, execution of Hudi's table services.

+1 on thinking about sharding,locking and HA upfront.

Thanks
Vinoth

On Thu, Apr 21, 2022 at 3:31 PM Alexey Kudinkin  wrote:

> Hey, folks!
>
> I feel there's quite a bit of confusion in this thread, so let's try to
> clear it: my understanding (please correct me if I'm wrong) is that
> Lake Manager was referred to as a service in a similar interpretation of
> how we call compaction, clustering and cleaning a* table services.*
>
> So, i'd suggest for us to be extra careful in operating familiar terms to
> avoid stirring up the confusion: for all things related to *RPC services *
> (like Metastore Server) we can call them "servers"*, *and for compaction,
> clustering and the rest we stick w/ "table services".
>
> If my understanding of the proposal is correct, then I think the proposal
> is to consolidate knobs and levers for Data Governance, Data Management,
> etc
> w/in the layer called *Lake Manager, *which will be orchestrating already
> existing table services through a nicely abstracted high-level API.
>
> Regarding adding any new *server* components: given Hudi's *stateless*
> architecture where we rely on standalone execution engines (like Spark or
> Flink) to operate, i don't really see us introducing a server component
> directly into Hudi's core. Metastore Server on the other hand will be a
> *standalone* component, that Hudi (as well as other processes) could be
> relying on to access the metadata.
>
> On Mon, Apr 18, 2022 at 10:07 PM Yue Zhang 
> wrote:
>
> > Thanks for all your attention.
> > Sure, we do need to take care of high availability in design.
> >
> > Also in my opinion this lake manager wouldn't drive hudi into a database
> > on the cloud. It is just an official option. Something like
> > HoodieDeltaStreamer and help users to reduce maintenance and hudi data
> > governance efforts.
> >
> > As for resource and performance concerns, this lake manager should be
> > designed as a planner/master, for example, lake manager will call out
> > cleaner apis to launch a (spark/flink) execution to delete files under
> > certain conditions based on table metadata information, rather than doing
> > works itself. So that the workload and resources requirement is much
> less.
> > But in general, I agree that we have to consider failure recovery and
> high
> > availability, etc.
> >
> > On 2022/04/19 04:30:22 Simon Su wrote:
> > > >
> > > > I agree with Danny said. IMO, there are two points that should be
> > > > considered
> > >
> > > 1. If Lake Manager is designed as a service, so we should consider its
> > High
> > > Availability, Dynamic Expanding/Shrinking, and state consistency.
> > > 2. How many resources will Lake Manager used to execute those actions
> of
> > > HUDI such as compaction, clustering, etc..
> > >
> >
>


Re: [DISCUSS] hudi index improve

2022-04-27 Thread Vinoth Chandar
Hi all,

This is a great discussion and nice to see how all of this is
coming together.

Penning down my thoughts.

A) +1 on exposing INDEX syntax, we can start with Spark/Flink where we have
full control on connectors and iterate faster.

B) Do we need a manual refresh mode? Almost all databases always keep index
in sync with data, I think its an easier model to begin with. thoughts?
See https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md already
adds the ability to re-build an index asynchronously.
This should also answer some of danny's questions as well.

C) Can we study how database allow specifying different types of indexes
and mimic that syntax? e.g
https://www.postgresql.org/docs/current/indexes-types.html

D) Indexing is a table service as well and it can be pulled into the table
management service/lake manager (or a cooler name we can give it :)). There
should be a lot of functionality we should be able to reuse here already
for building the indexing service.

Love to help streamline this efforts. its very valuable. Overall +1

Thanks
Vinoth

On Mon, Apr 18, 2022 at 7:54 PM Danny Chan  wrote:

> In general, it seems that the INDEX commands mainly serve the batch
> scenarios, there are some cases that need to clarify here:
>
> 1. When a user creates an index with manuaral refresh first then
> inserts a batch of data(named d1) into the table, does the index
> created take effect on d1 ?
> 2. If a user executes a DROP INDEX command on the table and there is
> another streaming job writing to the table using and building the
> index, what happens then ?
> 3. For multiple engines index support, do you mean to execute CREATE
> INDEX syntax on all kinds of engines ? Does that mean we should
> support building indexes for all these engines. And if the writer is a
> different engine that also writes/reads the index, how to handle the
> transactions ?
> 4. We may distinguish between different kinds of indexes from the
> syntax, because the current index of Hudi (column stats index, bloom
> filter
> index, and pk index) are all a little different from the database pk
> index and secondary index, should we give them specific KEYWORD ?
>
> Best,
> Danny
>
> Y Ethan Guo  于2022年4月19日周二 01:49写道:
> >
> > +1 it would be great to make Hudi's index support all query engines.
> Given
> > that we already have multi-modal index (column stats index, bloom filter
> > index) in metadata table and there is a proposal to have a metastore
> > server, is the ultimate goal to serve the index from metastore leveraging
> > metadata table for all engines?
> >
> > On Mon, Apr 18, 2022 at 7:39 AM wangxianghu  wrote:
> >
> > > +1 on index improvement
> > > index optimization is a very valuable thing for hudi
> > > Looking forward to the design doc
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > At 2022-04-18 11:18:35, "Forward Xu"  wrote:
> > > >Hi All,
> > > >
> > > >I want to improve hudi‘s index. There are four main steps to achieve
> this
> > > >
> > > >1. Implement index syntax
> > > >a. Implement index syntax for spark sql [1] , I have submitted the
> > > >first pr.
> > > >b. Implement index syntax for prestodb sql
> > > >c. Implement index syntax for trino sql
> > > >
> > > >2. read/write index decoupling
> > > >The read/write index is decoupled from the computing engine side, and
> the
> > > >sql index syntax of the first step can be independently executed and
> > > called
> > > >through the API.
> > > >
> > > >3. build index service
> > > >
> > > >Promote the implementation of the hudi service framework, including
> index
> > > >service, metastore service[2], compact/cluster service[3], etc.
> > > >
> > > >4. Index Management
> > > >There are two kinds of management semantic for Index.
> > > >
> > > >   - Automatic Refresh
> > > >   - Manual Refresh
> > > >
> > > >
> > > >   1. Automatic Refresh
> > > >
> > > >When a user creates an index on the main table without using WITH
> DEFERRED
> > > >REFRESH syntax, the index will be managed by the system
> automatically. For
> > > >every data load to the main table, the system will immediately
> trigger a
> > > >load to the index automatically. These two data loading (to main
> table and
> > > >index) is executed in a transactional manner, meaning that it will be
> > > >either both success or neither success.
> > > >
> > > >The data loading to index is incremental, avoiding an expensive total
> > > >refresh.
> > > >
> > > >If a user performs the following command on the main table, the system
> > > will
> > > >return failure. (reject the operation)
> > > >
> > > >
> > > >   - Data management command: UPDATE/DELETE/DELETE.
> > > >   - Schema management command: ALTER TABLE DROP COLUMN, ALTER TABLE
> > > CHANGE
> > > >   DATATYPE, ALTER TABLE RENAME. Note that adding a new column is
> > > supported,
> > > >   and for dropping columns and change datatype command, hudi will
> check
> > > >   whether it will impact the index table, if not, the operation 

Re: [VOTE] Release 0.11.0, release candidate #3

2022-04-26 Thread Vinoth Chandar
+1 (binding)

Ran RC checks. Passed

On Sun, Apr 24, 2022 at 6:18 AM Shiyan Xu 
wrote:

> Hi everyone,
>
> Please review and vote on the release candidate #3 for the version 0.11.0,
> as follows:
>
> [ ] +1, Approve the release
>
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
>
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint E1FACC15B67B2C5149224052D3B314F3B6E9C746 [3],
>
> * all artifacts to be deployed to the Maven Central Repository [4],
>
> * source code tag "0.11.0-rc3" [5],
>
>
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
>
>
> Thanks,
>
> Release Manager
>
>
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12350673
>
> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.11.0-rc3/
>
> [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
>
> [4] https://repository.apache.org/content/repositories/orgapachehudi-1078/
>
> [5] https://github.com/apache/hudi/releases/tag/release-0.11.0-rc3
>


Re: [DISCUSS] Hudi community sync time

2022-04-26 Thread Vinoth Chandar
+1 as well. Current PST times are pretty hard for many folks.

On Sat, Apr 16, 2022 at 6:20 AM Gary Li  wrote:

> +1 for splitting into two sessions. The current schedule is challenging for
> both US and Chinese folks. We can organize another session for the Chinese
> timezone.
>
> Calling out for folks living in the Chinese timezone, please reply to this
> email thread if you are interested to join a sync meeting. We can schedule
> one if we have enough interest.
>
> Best,
> Gary
>
> On Sat, Apr 16, 2022 at 2:36 AM Bhavani Sudha 
> wrote:
>
> > Hello all,
> >
> > Our current monthly community syncs happen around 7 am pacific time on
> the
> > last wednesday of each month. It is already 10 pm in China and we dont
> get
> > to see Chinese folks in the community sync call. We have users from
> > different time zones and finding an overlap is challenging as it is. In
> > this context I am proposing the following:
> >
> > - We can split the community syncs into two - one catered towards Chinese
> > time and the other one that happens currently for the rest of the folks.
> > - If we split it into two different syncs then, we can move the 7 am
> > pacific time to 8 am or 9am as well.
> >
> > Please share your thoughts on this proposal.
> >
> > Thanks,
> > Sudha
> >
>


Re: spark 3.2.1 built-in bloom filters

2022-04-04 Thread Vinoth Chandar
By all means. That would be great.

Always looking for helping hand in improving docs

On Sat, Apr 2, 2022 at 6:18 AM Nicolas Paris 
wrote:

> Hi Vinoth,
>
> Thanks for your in depth explanations. I think those details could be
> of interest in the documentation. I can work on this if agreed
>
> On Wed, 2022-03-30 at 14:36 -0700, Vinoth Chandar wrote:
> > Hi,
> >
> > I noticed that it finally landed. We actually began tracking that
> > JIRA
> > while initially writing Hudi at Uber.. Parquet + Bloom Filters has
> > taken
> > just a few years :)
> > I think we could switch out to reading the built-in bloom filters as
> > well.
> > it could make the footer reading lighter potentially.
> >
> > Few things that Hudi has built on top would be missing
> >
> > - Dynamic bloom filter support, where we auto size current bloom
> > filters
> > based on number of records, given a fpp target
> > - Our current DAG that optimizes for checking records against bloom
> > filters
> > is still needed on writer side. Checking bloom filters for a given
> > predicate e.g id=19, is much simpler compared to matching say a 100k
> > ids
> > against 1000 files. We need to be able to amortize the cost of these
> > 100M
> > comparisons.
> >
> > On the future direction, with 0.11, we are enabling storing of bloom
> > filters and column ranges inside the Hudi metadata table (MDT).
> > *(what we
> > call multi modal indexes).
> > This helps us make the access more resilient towards cloud storage
> > throttling and also more performant (we need to read much fewer
> > files)
> >
> > Over time, when this mechanism is stable, we plan to stop writing out
> > bloom
> > filters in parquet and also integrate the Hudi MDT with different
> > query
> > engines for point-ish lookups.
> >
> > Hope that helps
> >
> > Thanks
> > Vinoth
> >
> >
> >
> >
> > On Mon, Mar 28, 2022 at 9:57 AM Nicolas Paris
> > 
> > wrote:
> >
> > > Hi,
> > >
> > > spark 3.2 ships parquet 1.12 which provides built-in bloom filters
> > > on
> > > arbirtrary columns. I wonder if:
> > >
> > > - hudi can benefit from them ? (likely in 0.11, but not with MOR
> > > tables)
> > > - would make sense to replace the hudi blooms with them ?
> > > - what would be the advantage of storing our blooms in hfiles
> > > (AFAIK
> > >   this is the future expected implementation) over the parquet
> > > built-in.
> > >
> > >
> > > here is the syntax:
> > >
> > > .option("parquet.bloom.filter.enabled#favorite_color", "true")
> > > .option("parquet.bloom.filter.expected.ndv#favorite_color",
> > > "100")
> > >
> > >
> > > and here some code to illustrate :
> > >
> > >
> > >
> https://github.com/apache/spark/blob/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L1654
> > >
> > >
> > >
> > > thx
> > >
>


Re: [ANNOUNCE] New Apache Hudi Committer - Zhaojing Yu

2022-03-31 Thread Vinoth Chandar
Congrats!

On Thu, Mar 31, 2022 at 4:06 AM leesf  wrote:

> Congrats!
>
> Vino Yang  于2022年3月31日周四 17:03写道:
>
> > Congrats!
> >
> > Best,
> > Vino
> >
> > Gary Li  于2022年3月25日周五 19:11写道:
> > >
> > > Congrats!
> > >
> > > Best,
> > > Gary
> > >
> > > On Fri, Mar 25, 2022 at 4:07 PM Shiyan Xu  >
> > > wrote:
> > >
> > > > Congrats!
> > > >
> > > > On Fri, Mar 25, 2022 at 1:40 PM Danny Chan 
> > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > On behalf of the PMC, I'm very happy to announce Zhaojing Yu as a
> new
> > > > > Hudi committer.
> > > > >
> > > > > Zhaojing is very active in Flink Hudi contributions, many cool
> > > > > features such as the flink streaming bootstrap, compaction service
> > and
> > > > > all kinds of writing modes are contributed by him. He also fixed
> many
> > > > > critical bugs from the Flink side.
> > > > >
> > > > > Besides that, Zhaojing is also active in use case publicity of Hudi
> > in
> > > > > China, he is very active in answering user questions in our
> Dingtalk
> > > > > group. Now he is working in Bytedance for pushing forward the
> > Volcanic
> > > > > cloud service Hudi products !
> > > > >
> > > > > Please join me in congratulating Zhaojing for becoming a Hudi
> > committer!
> > > > >
> > > > > Cheers,
> > > > > Danny
> > > > >
> > > >
> > > >
> > > > --
> > > > --
> > > > Best,
> > > > Shiyan
> > > >
> >
>


Re: spark 3.2.1 built-in bloom filters

2022-03-30 Thread Vinoth Chandar
Hi,

I noticed that it finally landed. We actually began tracking that JIRA
while initially writing Hudi at Uber.. Parquet + Bloom Filters has taken
just a few years :)
I think we could switch out to reading the built-in bloom filters as well.
it could make the footer reading lighter potentially.

Few things that Hudi has built on top would be missing

- Dynamic bloom filter support, where we auto size current bloom filters
based on number of records, given a fpp target
- Our current DAG that optimizes for checking records against bloom filters
is still needed on writer side. Checking bloom filters for a given
predicate e.g id=19, is much simpler compared to matching say a 100k ids
against 1000 files. We need to be able to amortize the cost of these 100M
comparisons.

On the future direction, with 0.11, we are enabling storing of bloom
filters and column ranges inside the Hudi metadata table (MDT). *(what we
call multi modal indexes).
This helps us make the access more resilient towards cloud storage
throttling and also more performant (we need to read much fewer files)

Over time, when this mechanism is stable, we plan to stop writing out bloom
filters in parquet and also integrate the Hudi MDT with different query
engines for point-ish lookups.

Hope that helps

Thanks
Vinoth




On Mon, Mar 28, 2022 at 9:57 AM Nicolas Paris 
wrote:

> Hi,
>
> spark 3.2 ships parquet 1.12 which provides built-in bloom filters on
> arbirtrary columns. I wonder if:
>
> - hudi can benefit from them ? (likely in 0.11, but not with MOR tables)
> - would make sense to replace the hudi blooms with them ?
> - what would be the advantage of storing our blooms in hfiles (AFAIK
>   this is the future expected implementation) over the parquet built-in.
>
>
> here is the syntax:
>
> .option("parquet.bloom.filter.enabled#favorite_color", "true")
> .option("parquet.bloom.filter.expected.ndv#favorite_color", "100")
>
>
> and here some code to illustrate :
>
>
> https://github.com/apache/spark/blob/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala#L1654
>
>
>
> thx
>


Re: [DISCUSS] New RFC to support Lock-free concurrency control on Merge-on-read tables

2022-03-24 Thread Vinoth Chandar
+1. Love to be a co-author on the RFC, if you are open to it.

On Mon, Mar 21, 2022 at 12:31 PM 冯健  wrote:

> Hi team,
>
> The situation is Optimistic concurrency control(OCC) has some limitation
>
>-
>
>When conflicts do occur, they may waste massive resources during every
>attempt (lakehouse-concurrency-control-are-we-too-optimistic
><
> https://hudi.apache.org/blog/2021/12/16/lakehouse-concurrency-control-are-we-too-optimistic
> >
>).
>-
>
>multiple writers may cause data duplicates when records with same new
>record-key arrives.multi-writer-guarantees
><
> https://hudi.apache.org/docs/concurrency_control#multi-writer-guarantees>
>
> There is some background information, with OCC, we assume Multiple writers
> won't write data to same FileID in most of time, if there is a FileId level
> conflict, the commit will be rollbacked. and FileID level conflict can't
> guarantee no duplicate if two records with same new record-key arrives in
> multiple writers, since the mapping of key-bucket is not consistent with
> bloom index.
>
> What I plan to do is support Lock-free concurrency control with a
> non-duplicates guarantee in hudi(only for Merge-On-Read tables).
>
>-
>
>With canIndexLogfiles index , multiple writers ingesting data into
>Merge-on-read tables can only append data to delta logs. This is a
>lock-free process if we can make sure they don’t write data to the same
> log
>file (plan to create multiple marker files to achieve this). And with
> log
>merge API(preCombine logic in Payload class), data in log files can be
> read
>properly
>-
>
>Since hudi already has an index type like Bucket index which can map
>key-bucket in a consistent way.  Data duplicates can be eliminated
>
>
> Thanks,
> Jian Feng
>


Re: 0.11.0 release timeline

2022-03-22 Thread Vinoth Chandar
+1 from me, as long as we don’t push it out more.

On Tue, Mar 22, 2022 at 12:29 Raymond Xu 
wrote:

> Ok Vinoth, thanks for highlighting this. BigQuery integration is an
> important feature to add to 0.11.0. I also see some other inflight work
> from the backlog. To accommodate this and other inflight issues / testing
> activities, I suggest making a one-time adjustment: pushing 7 days from the
> original timeline, as in Mar 31 feature freeze and Apr 3 for cutting
> release branch.
>
> Sounds good to the community here?
>
> --
> Best,
> Raymond
>
>
> On Tue, Mar 22, 2022 at 12:50 PM Vinoth Govindarajan <
> vinoth.govindara...@gmail.com> wrote:
>
> > Hi Raymond,
> > I'm working on the Hudi <> BigQuery Integration RFC-34, I'm trying to
> wrap
> > up everything and send out the PR before the end of this week, I was
> > wondering would it be possible to include this PR as part of the 0.11.0
> > release, let me know what I need to do to make this part of next release,
> > thanks!
> >
> > Best,
> > Vinoth
> >
> >
> > On Fri, Mar 18, 2022 at 10:58 PM Raymond Xu  >
> > wrote:
> >
> > > Hi all,
> > >
> > > As we're approaching late March when we intended to release 0.11.0, I'd
> > > like to call out this timeline
> > >
> > > - Mar 24 00:00 PST : feature freeze - new features/functionalities
> won't
> > be
> > > merged to master (6 days from now)
> > > - Mar 27 00:00 PST : cut release branch and start RC voting/testing (9
> > days
> > > from now)
> > >
> > > Please kindly highlight any concerns. Thank you.
> > >
> > > Best,
> > > Raymond
> > >
> >
>


Re: [DISCUSS] New RFC to create LogCompaction action for MOR tables?

2022-03-21 Thread Vinoth Chandar
+1 overall

On Sat, Mar 19, 2022 at 5:02 PM Surya Prasanna 
wrote:

> Hi Sagar,
> Sorry for the delay in response. Thanks for the questions.
>
> 1. Trying to understand the main goal. Is it to balance the tradeoff
> between read and write amplification for metadata table? Or is it purely to
> optimize for reads?
> > On large tables, write amplification is a side effect of frequent
> compactions. So, instead of increasing the frequency of full compaction, we
> are proposing minor compaction(LogCompaction) to be done frequently to
> merge only the log blocks and write a new log block. By merging the blocks,
> there are less no. of blocks to deal with during read, that way we are
> optimizing for read performance and potentially avoiding the write
> amplification problem.
>
> 2. Why do we need a separate action? Why can't any of the existing
> compaction strategies (or a new one if needed) help to achieve this?
> > A new compaction strategy can be added, but we thought it might
> complicate the existing logic and need to rely on some hacks, especially
> since Compaction action writes to a base file and places a .commit file
> upon completion. Whereas, in our use case we are not concerned with the
> base file at all, instead we are merging blocks and writing back to the log
> file. So, we thought it is better to use a new action(called
> LogCompaction), which works at a log file level and writes back to the log
> file. Since log files are in general added by deltacommit, upon completion
> LogCompaction can place a .deltacommit.
>
> 3. Is the proposed LogCompaction a replacement for regular compaction for
> metadata table i.e. if LogCompaction is enabled then compaction cannot be
> done?
> > LogCompaction is not a replacement for regular compaction. LogCompaction
> is performed as a minor compaction so as to reduce the no. of log blocks to
> consider. It does not consider base files while merging the log blocks. To
> merge log files with base file Compaction action is still needed. By using
> LogCompaction action frequently, the frequency with which we do full scale
> compaction is reduced.
> Consider a scenario in which, after 'X' no. of LogCompaction actions, for
> some file groups the log files size becomes comparable to that of base file
> size, in this scenario LogCompaction action is going to take close to the
> same amount of time as compaction action. So, now instead of LogCompaction,
> full scale Compaction needs to be performed on those file groups. In future
> we can also introduce logic to determine what is the right
> action(Compaction or LogCompaction) to be performed depending on the state
> of the file group.
>
> Thanks,
> Surya
>
>
> On Fri, Mar 18, 2022 at 11:22 PM Surya Prasanna Yalla 
> wrote:
>
> >
> >
> > -- Forwarded message -
> > From: sagar sumit 
> > Date: Wed, Mar 16, 2022 at 11:17 PM
> > Subject: Re: [DISCUSS] New RFC to create LogCompaction action for MOR
> > tables?
> > To: 
> >
> >
> > Hi Surya,
> >
> > This is a very interesting idea! I'll be looking forward to RFC.
> >
> > I have a few high-level questions:
> >
> > 1. Trying to understand the main goal. Is it to balance the tradeoff
> > between read and write amplification for metadata table? Or is it purely
> to
> > optimize for reads?
> > 2. Why do we need a separate action? Why can't any of the existing
> > compaction strategies (or a new one if needed) help to achieve this?
> > 3. Is the proposed LogCompaction a replacement for regular compaction for
> > metadata table i.e. if LogCompaction is enabled then compaction cannot be
> > done?
> >
> > Regards,
> > Sagar
> >
> > On Thu, Mar 17, 2022 at 12:51 AM Surya Prasanna <
> > prasannakumar...@gmail.com>
> > wrote:
> >
> > > Hi Team,
> > >
> > >
> > > Record level index uses a metadata table which is a MOR table type.
> > >
> > > Each delta commit in the metadata table creates multiple hfile log
> blocks
> > > and so to read them multiple file handles have to be opened which might
> > > cause issues in read performance. To reduce the read performance,
> > > compaction can be run frequently which basically merges all the log
> > blocks
> > > to base file and creates another version of base file. If this is done
> > > frequently, it would cause write amplification.
> > >
> > > Instead of merging all the log blocks to base file and doing a full
> > > compaction, minor compaction can be done which basically merges log
> > blocks
> > > and creates one new log block.
> > >
> > > This can be achieved by adding a new action to Hudi called
> LogCompaction
> > > and requires a RFC. Please let me know what you think.
> > >
> > >
> > > Thanks,
> > >
> > > Surya
> > >
> >
>


Re: Unbundling "spark-avro" dependency

2022-03-08 Thread Vinoth Chandar
Thanks Alexey.

This was actually the case for a while now, I think. From what I can see,
our quickstart for spark still suggests passing spark-avro in via
--packages, but utilities bundle related examples are relying on the fact
that this is pre-bundled.

I do acknowledge that with recent Spark 3.x versions, breakages have become
much more frequent, amplifying this pain. However, to prevent jobs from
failing upon upgrade (i.e forcing everyone to redeploy streaming + batch
job with the --packages flag), I would prefer if we actually kept the same
bundling behavior with the following simplifications.

1. We have three spark profiles now - spark2, spark3.1.x, and spark3
(3.2.1). We continue to bundle spark-avro and support the latest spark
minor version
2. We retain and make the docs clearer about how users can "optionally"
unbundle and deploy for other versions.

Given other large features going out, turned on by default this release,
not sure if its a good idea to introduce a breaking change like this.

Thanks
Vinoth

On Tue, Mar 8, 2022 at 1:32 PM Alexey Kudinkin  wrote:

> Hello, everyone!
>
> While working on HUDI-3549 <
> https://issues.apache.org/jira/browse/HUDI-3549>,
> we've surprisingly discovered that Hudi actually bundles "spark-avro"
> dependency *by default*.
>
> This is problematic b/c "spark-avro" is tightly coupled with some of the
> other Spark components making up its core distribution (ie being packaged
> in Spark itself, not an external packages, one example of that is
> "spark-sql")
>
> In regards to HUDI-3549
>  itself,
> the problem in there unfolded like following:
>
>1. We've built "hudi-spark-bundle" which got "spark-avro" 3.2.1 bundled
>along with it
>2. @Sivabalan tried to use this Hudi bundle w/ Spark 3.2.0
>3. It failed b/c "spark-avro" 3.2.1 is *not compatible *w/ "spark-sql"
>3.2.0 (b/c of https://github.com/apache/spark/pull/34978, fixing typo
>and renaming Internal API methods DataSourceUtils)
>
>
> To avoid this problems going forward, our proposal is to
>
>1. *Unbundle* "spark-avro" from Hudi bundles by default (practically
>this means that Hudi users would need to now specify spark-avro via
>`--packages` flag, since it's not part of Spark's core distribution)
>2. (Optional) If community still sees value in bundling (and shading)
>"spark-avro" in some cases, we can add Maven profile that would allow
> to do
>that *ad hoc*.
>
> We've put a PR#4955  with the
> proposed changes.
>
> Looking forward to your feedback.
>


Re: Next stop : Minor Or Major release?

2022-02-17 Thread Vinoth Chandar
+1 on B as well. same rationale as Raymond's. I think we have all major
chunks landed or PRs up.
Love to provide integration testing before the release.

On Thu, Feb 17, 2022 at 4:25 PM Raymond Xu 
wrote:

> I'm +1 to B. There are really awesome features planned for 0.11.0. Hoping
> to see these more thoroughly tested in the major release.
>
>
> --
> Best,
> Raymond
>
>
> On Wed, Feb 16, 2022 at 5:13 AM Sivabalan  wrote:
>
> > Hi folks,
> >As Hudi community has been very active and is used by many across the
> > globe, we would like to have a continuous train of releases. Every 2 to 3
> > months a major release and immediately following the major release, a
> minor
> > bug fix release(which we agreed upon as a community). If we look at the
> > roadmap laid out here , we may not be
> > able
> > to meet the deadline if we plan for a major release by Feb end. Even if
> not
> > all, we are looking to complete a sizable features. We might need
> atleast 2
> > weeks for proper integration testing.
> >
> > Having said that we did a minor bug fix release of 0.10.1 by Jan 26th, we
> > have two options with us.
> >
> > Option A: Do another minor bug fix release by end of Feb. And do 0.11 by
> > end of March.
> > Option B: We can go for 0.11 by end of march w/o needing another bug fix
> > release in between since we just had a release 2 weeks back.
> >
> > Do remember that, if we plan to go w/ 0.10.2, it might be yet another bug
> > fix release. and so we may not get any features as such which went in
> after
> > 0.10.0.
> >
> > Wanted to hear your thoughts and opinions.
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


Re: [DISCUSS] Change data feed for spark sql

2022-02-14 Thread Vinoth Chandar
Hi all,

I would love to not introduce new constructs like "timestamp", "snapshots.
Hudi already has a clear notion of commit times, that can unlock this.
Can we just use this as an opportunity to standardize the incremental
query's schema?
In fact, don't we already have change feed with our incremental query - we
need to emit delete records, and the old images of records.
Those are the only gaps I see.

+1 for an RFC. I would be happy to jam on the design!

Thanks
Vinoth

On Mon, Feb 14, 2022 at 6:25 AM Sivabalan  wrote:

> +1 for the feature. I see a lot of benefits like clustering, index
> building etc.
>
> On Sun, 13 Feb 2022 at 22:21, leesf  wrote:
> >
> > +1 for the feature.
> >
> > vino yang  于2022年2月12日周六 22:14写道:
> >
> > > +1 for this feature, looking forward to share more details or design
> doc.
> > >
> > > Best,
> > > Vino
> > >
> > > Xianghu Wang  于2022年2月12日周六 17:06写道:
> > >
> > > > this is definitely a great feature
> > > >  +1
> > > >
> > > > On 2022/02/12 02:32:32 Forward Xu wrote:
> > > > > Hi All,
> > > > >
> > > > > I want to support change data feed for to spark sql, This feature
> can
> > > be
> > > > > achieved in two ways.
> > > > >
> > > > > 1. Call Procedure Command
> > > > > sql syntax
> > > > > CALL system.table_changes('tableName',  start_timestamp,
> end_timestamp)
> > > > > example:
> > > > > CALL system.table_changes('tableName', TIMESTAMP '2021-01-23
> 04:30:45',
> > > > > TIMESTAMP '2021-02-23 6:00:00')
> > > > >
> > > > > 2. Support querying MOR(CDC) table as of a savepoint
> > > > > SELECT * FROM A.B TIMESTAMP AS OF 1643119574;
> > > > > SELECT * FROM A.B TIMESTAMP AS OF '2019-01-29 00:37:58' ;
> > > > >
> > > > > SELECT * FROM A.B TIMESTAMP AS OF '2019-01-29 00:37:58'  AND
> > > '2021-02-23
> > > > > 6:00:00' ;
> > > > > SELECT * FROM A.B VERSION AS OF 'Snapshot123456789';
> > > > >
> > > > > Any feedback is welcome!
> > > > >
> > > > > Thank you.
> > > > >
> > > > > Regards,
> > > > > Forward Xu
> > > > >
> > > > > Related Links:
> > > > > [1] Call Procedure Command <
> > > > https://issues.apache.org/jira/browse/HUDI-3161>
> > > > > [2] Support querying a table as of a savepoint
> > > > > 
> > > > > [3] Change data feed
> > > > > <
> > > >
> > >
> https://docs.databricks.com/delta/delta-change-data-feed.html#language-sql
> > > > >
> > > > >
> > > >
> > >
>
>
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Dropping Spark 3.0.x support in 0.11

2022-01-23 Thread Vinoth Chandar
+1 for this. The rate of API breakages across minor Spark versions are bit
untenable anyway.

On Wed, Jan 12, 2022 at 1:22 AM Raymond Xu 
wrote:

> Hi Chen,
>
> yes this is actually been worked on by liujinhui
> https://issues.apache.org/jira/browse/HUDI-2370
>
> and it's planned for 0.11
>
> --
> Best,
> Raymond
>
>
> On Sun, Jan 9, 2022 at 4:22 PM 陈 翔  wrote:
>
> > There is an other thing I care about: in spark 3.2.0, parquet 1.12
> > introduces the function of encryption.
> >
> >
> https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#columnar-encryption
> > Do we have plans to introduce this feature into hudi? I think this
> > function is a great help to the security of the data lake
> >
> >
> > 发件人: Raymond Xu 
> > 日期: 星期一, 2022年1月10日 上午5:48
> > 收件人: dev@hudi.apache.org 
> > 主题: [DISCUSS] Dropping Spark 3.0.x support in 0.11
> > Hi all,
> >
> > The incompatible changes from different Spark 3 versions, namely spark
> > 3.0.x, spark 3.1.x, and the latest spark 3.2, are making the spark
> support
> > difficult. I'm proposing to drop Spark 3.0.x support from the next major
> > release 0.11.0. Spark 3.0.x has been supported since 0.7.0 through
> 0.10.0.
> > The Spark 3 support matrix will look like
> >
> >- 0.11.0
> >   - 3.2.0 (default build), 3.1.x
> >- 0.10.x
> >   - 3.1.x (default build), 3.0.x
> >- 0.7.0 - 0.9.0
> >   - 3.0.x
> >- 0.6.0 and prior
> >   - not supported
> >
> > Thanks.
> >
> > Best,
> > Raymond
> >
>


Re: [VOTE] Release 0.10.1, release candidate #2

2022-01-22 Thread Vinoth Chandar
+1 (binding)

Ran my rc checks on updated link and changing my vote to a +1

On Sat, Jan 22, 2022 at 4:10 AM Sivabalan  wrote:

> my bad, the link([2]) was wrong. It is
> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.1-rc2/.
> Can you take a look please?
>
> On Sat, 22 Jan 2022 at 00:08, Vinoth Chandar  wrote:
>
> > -1
> >
> > The artifact version is wrong! It should be 0.10.*1*
> >
> >
> >- hudi-0.10.0-rc2.src.tgz
> ><
> >
> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc2/hudi-0.10.0-rc2.src.tgz
> > >
> >- hudi-0.10.0-rc2.src.tgz.asc
> ><
> >
> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc2/hudi-0.10.0-rc2.src.tgz.asc
> > >
> >- hudi-0.10.0-rc2.src.tgz.sha512
> ><
> >
> https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc2/hudi-0.10.0-rc2.src.tgz.sha512
> > >
> >
> > grep version hudi-0.10.0-rc2/pom.xml | grep rc2
> >   0.10.0-rc2
> >
> >
> > Why are all the arc
> >
> > On Thu, Jan 20, 2022 at 3:53 AM Sivabalan  wrote:
> >
> > > Hi everyone,
> > >
> > > Please review and vote on the release candidate #2 for the version
> > 0.10.1,
> > > as follows:
> > >
> > > [ ] +1, Approve the release
> > >
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > >
> > > The complete staging area is available for your review, which includes:
> > >
> > > * JIRA release notes [1],
> > >
> > > * the official Apache source release and binary convenience releases to
> > be
> > > deployed to dist.apache.org [2], which are signed with the key with
> > > fingerprint ACD52A06633DB3B2C7D0EA5642CA2D3ED5895122 [3],
> > >
> > > * all artifacts to be deployed to the Maven Central Repository [4],
> > >
> > > * source code tag "release-0.10.1-rc2" [5],
> > >
> > >
> > > The vote will be open for at least 72 hours. It is adopted by majority
> > > approval, with at least 3 PMC affirmative votes.
> > >
> > >
> > > Thanks,
> > > Release Manager
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12351135
> > >
> > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc2
> > >
> > > [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
> > >
> > > [4]
> > >
> > >
> >
> https://repository.apache.org/content/repositories/orgapachehudi-1052/org/apache/hudi/
> > >
> > > [5] https://github.com/apache/hudi/tree/release-0.10.1-rc2
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: [VOTE] Release 0.10.1, release candidate #2

2022-01-21 Thread Vinoth Chandar
-1

The artifact version is wrong! It should be 0.10.*1*


   - hudi-0.10.0-rc2.src.tgz
   

   - hudi-0.10.0-rc2.src.tgz.asc
   

   - hudi-0.10.0-rc2.src.tgz.sha512
   


grep version hudi-0.10.0-rc2/pom.xml | grep rc2
  0.10.0-rc2


Why are all the arc

On Thu, Jan 20, 2022 at 3:53 AM Sivabalan  wrote:

> Hi everyone,
>
> Please review and vote on the release candidate #2 for the version 0.10.1,
> as follows:
>
> [ ] +1, Approve the release
>
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
>
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint ACD52A06633DB3B2C7D0EA5642CA2D3ED5895122 [3],
>
> * all artifacts to be deployed to the Maven Central Repository [4],
>
> * source code tag "release-0.10.1-rc2" [5],
>
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
>
> Thanks,
> Release Manager
>
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12351135
>
> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc2
>
> [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
>
> [4]
>
> https://repository.apache.org/content/repositories/orgapachehudi-1052/org/apache/hudi/
>
> [5] https://github.com/apache/hudi/tree/release-0.10.1-rc2
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] New RFC? Add Call Procedure Command for spark sql

2022-01-10 Thread Vinoth Chandar
+1 please start a RFC

On Fri, Jan 7, 2022 at 5:50 AM Forward Xu  wrote:

> Hi All,
>
> I want to add Call Procedure Command to spark sql, which will be very
> useful to meet DDL and DML functions that cannot be handled. I can think of
> the following 4 aspects:
> - Commit management
> - Metadata table management
> - Table migration
> - Optimization table
>
> The main function has been implemented, and two Procedure
> implementations, *show_commits
> *and *rollback_to_instant* have been added. Here is the link:
> https://github.com/apache/hudi/pull/4535
>
> Is this a good idea to start an RFC?
>
> Thank you.
>
> Regards,
> Forward Xu
>


Re: 0.10.1 Release timeline

2021-12-28 Thread Vinoth Chandar
When we say code freeze, does it mean that all commits that need to be part
of 0.10.1 be in master by Jan 7?

Thanks
Vinoth

On Tue, Dec 28, 2021 at 11:02 AM Sivabalan  wrote:

> Hi folks,
> As agreed upon in another thread, wanted to propose a timeline for
> 0.10.1 minor release(bug fix release). Since this is a bug fix release, I
> am thinking end of next week Jan 7 as code freeze. Let me know what you
> think or if someone needs more time looking to push any bug fixes.
>
> Also, would like to know if anyone wants to volunteer to be RM(Release
> manager). I can guide you through the process if you have not been a RM
> before. If no one volunteers, I can chime in to be RM.
>
> I will follow up in other thread wrt timelines for minor releases going
> forward.
>
> --
> Regards,
> -Sivabalan
>


Re: Regular minor/patch releases

2021-12-28 Thread Vinoth Chandar
Hi,

I would love for us to get into the cadence of a regular bug fix/minor
release, on top of the latest major version.
What do you think about a minor release every month?
Once there is a new major release, we will switch to issuing the next minor
release on top of it. i.e once 0.11.0 is out, the next minor will be 0.11.1
and not 0.10.x

On Tue, Dec 28, 2021 at 11:08 AM Sivabalan  wrote:

> Following up on having regular minor bug fix release, this is what I am
> thinning.
> major release every 2 months to 2.5 months and a minor bug fix release on
> the following month of major release. If incase major release gets pushed,
> we can skip having a 2nd minor release for now(due to resource
> availability).
> If we have consensus, we can try to have a minor release 1 month after any
> major releases. For eg, we had a major release by dec 10. And so jan 2nd
> week is when we can target 0.10.1. we might need atleast a month from
> major release to have accrued some bug fixes and hence.
>
> Open to hear thoughts from the community.
>
>
>
> On Wed, Dec 15, 2021 at 2:15 PM Vinoth Chandar  wrote:
>
> > Hi all,
> >
> > Thanks for chiming in with the feedback. Looks like there is broad
> support
> > for this.
> >
> > Responding to few of the views below.
> >
> > >With the rush in features without enough tests, I'm afraid the major
> > release version is never ready for production,
> > While I agree with you, don't want to be very idealistic here either.
> 0.10
> > for e.g had a lot of testing on RCs and bug fixes after as well. And some
> > of the features were hardened at places at Uber before we released, but
> > open source major releases are generally rough (you can even see how
> rough
> > newer Spark versions are for e.g), and the community puts in the effort
> to
> > make it more and more stable going forward. Hudi's problem IMO has been
> > that we have done only major releases from 0.6 to 0.10 (given our
> resource
> > crunch during the pandemic times). Now, is a good time to revisit this.
> >
> > >when fixing bugs against the master branch, the contributors/committers
> > should also open a new PR
> > We can try this and encourage this always. I am just worried that this
> adds
> > more burden on contributors and things may get missed. IMO we can pick
> two
> > RMs at any time. One for the next major release and one for the next
> minor
> > release and have them shepherd the bug fixes through? We mark JIRAs with
> > two fix versions.
> >
> > >And for minor releases, there should only include the bug fixes, no
> > breaking change, no feature, it should not be a hard work i think.
> > +100 on this. otherwise it defeats the purpose of the minor release.
> >
> > Thanks
> > Vinoth
> >
> > On Wed, Dec 15, 2021 at 7:22 AM leesf  wrote:
> >
> > > +1
> > >
> > > We could create new branches such as release-0.10 as the master branch
> > for
> > > 0.10.0, 0.10.1 .etc version release, and when fixing bugs against the
> > > master branch, the contributors/committers should also open a new PR
> > > against the release-0.10 branch if needed. That would avoid
> > cherry-picking
> > > all bug fixes from master to release-0.10 at one time and cause so many
> > > conflicts. You would see the Spark[1] and Flink[2] community also
> > > maintaining a multi-master branch as well.
> > >
> > > [1] https://github.com/apache/spark/tree/branch-3.1
> > > https://github.com/apache/spark/tree/branch-3.2
> > > [2] https://github.com/apache/flink/tree/release-1.12
> > > https://github.com/apache/flink/tree/release-1.13
> > >
> > > vino yang  于2021年12月15日周三 18:12写道:
> > >
> > > > +1
> > > >
> > > > Agree that minor release mostly for bug fix purpose.
> > > >
> > > > Best,
> > > > Vino
> > > >
> > > > Danny Chan  于2021年12月15日周三 10:35写道:
> > > >
> > > > > I guess we must do that for current rapid development and
> iteration.
> > As
> > > > for
> > > > > the release 0.10.0, after the announcement of only a few days we
> have
> > > > > received a bunch of bugs reported by the github issues: such as
> > > > >
> > > > > - the empty meta file: https://github.com/apache/hudi/issues/4249
> > > > > - and the timeline based marker files:
> > > > > https://github.com/apache/hudi/issues/4230
> > > > >
> > > > > With the rush in features

Re: Preparation for 0.10.1 minor release

2021-12-20 Thread Vinoth Chandar
Hi Siva,

Can we use "fix version(s)" to track this? We can tag multiple fix versions
with each JIRA.
RM's work can be just skim the commits landing and mark candidates for
0.10.1, keep cherry-picking to a 0.10.1 feature branch.?

Thanks
Vinoth

On Mon, Dec 20, 2021 at 8:42 AM Sivabalan  wrote:

> Hey folks,
>   As agreed in previous email thread, we will work towards 0.10.1 minor
> release with bug fixes from 0.10.0. I have started a jira
>  to track all bugs that
> needs to go into 0.10.1. Please feel free to add comments with any
> patches you think we might want to get into 0.10.1. Please use informed
> decision as we are looking at only bug fixes and not new features. The RM
> will take final decision on what could go into 0.10.1 (only those that are
> deemed as a bug fix). But wanted to start tracking so that once we start
> the release process, would be easier as we have a
> shortlisted patches already.
>
>
> Thanks for your cooperation. Let me know if you have any
> questions/clarifications.
>
> --
> Regards,
> -Sivabalan
>


PSA, PR merges slowed down due to flaky CI

2021-12-16 Thread Vinoth Chandar
Hi all,

We have been fighting some flakiness in the CI. I think we were able to
resolve the IT tests continuously failing in Azure due to memory issues.
There is some residual issues still.

We would like to take sometime to resolve these before we push on with the
PR backlog we have. Appreciate your patience.

Folks, working on this - if you could keep this thread updated for
everyone, that will be great!

Thanks
Vinoth


Re: Regular minor/patch releases

2021-12-15 Thread Vinoth Chandar
Hi all,

Thanks for chiming in with the feedback. Looks like there is broad support
for this.

Responding to few of the views below.

>With the rush in features without enough tests, I'm afraid the major
release version is never ready for production,
While I agree with you, don't want to be very idealistic here either. 0.10
for e.g had a lot of testing on RCs and bug fixes after as well. And some
of the features were hardened at places at Uber before we released, but
open source major releases are generally rough (you can even see how rough
newer Spark versions are for e.g), and the community puts in the effort to
make it more and more stable going forward. Hudi's problem IMO has been
that we have done only major releases from 0.6 to 0.10 (given our resource
crunch during the pandemic times). Now, is a good time to revisit this.

>when fixing bugs against the master branch, the contributors/committers
should also open a new PR
We can try this and encourage this always. I am just worried that this adds
more burden on contributors and things may get missed. IMO we can pick two
RMs at any time. One for the next major release and one for the next minor
release and have them shepherd the bug fixes through? We mark JIRAs with
two fix versions.

>And for minor releases, there should only include the bug fixes, no
breaking change, no feature, it should not be a hard work i think.
+100 on this. otherwise it defeats the purpose of the minor release.

Thanks
Vinoth

On Wed, Dec 15, 2021 at 7:22 AM leesf  wrote:

> +1
>
> We could create new branches such as release-0.10 as the master branch for
> 0.10.0, 0.10.1 .etc version release, and when fixing bugs against the
> master branch, the contributors/committers should also open a new PR
> against the release-0.10 branch if needed. That would avoid cherry-picking
> all bug fixes from master to release-0.10 at one time and cause so many
> conflicts. You would see the Spark[1] and Flink[2] community also
> maintaining a multi-master branch as well.
>
> [1] https://github.com/apache/spark/tree/branch-3.1
> https://github.com/apache/spark/tree/branch-3.2
> [2] https://github.com/apache/flink/tree/release-1.12
> https://github.com/apache/flink/tree/release-1.13
>
> vino yang  于2021年12月15日周三 18:12写道:
>
> > +1
> >
> > Agree that minor release mostly for bug fix purpose.
> >
> > Best,
> > Vino
> >
> > Danny Chan  于2021年12月15日周三 10:35写道:
> >
> > > I guess we must do that for current rapid development and iteration. As
> > for
> > > the release 0.10.0, after the announcement of only a few days we have
> > > received a bunch of bugs reported by the github issues: such as
> > >
> > > - the empty meta file: https://github.com/apache/hudi/issues/4249
> > > - and the timeline based marker files:
> > > https://github.com/apache/hudi/issues/4230
> > >
> > > With the rush in features without enough tests, I'm afraid the major
> > > release version is never ready for production, unless there is
> production
> > > validation like in Uber internal.
> > >
> > > And for minor releases, there should only include the bug fixes, no
> > > breaking change, no feature, it should not be a hard work i think.
> > >
> > > Best,
> > > Danny
> > >
> > > Sivabalan 于2021年12月14日 周二上午4:06写道:
> > >
> > > > +1 in general. but yeah, not sure if we have resources to do this for
> > > every
> > > > major release.
> > > >
> > > > On Mon, Dec 13, 2021 at 10:01 AM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > In the past we had plans for minor releases [1], but invariably we
> > end
> > > up
> > > > > doing major ones, which also deliver the bug fixes.
> > > > >
> > > > > The reason was the cost involved in doing a release. We have made
> > some
> > > > good
> > > > > progress towards regression/integration test, which prompts me to
> > > revive
> > > > > this.
> > > > >
> > > > > What does everyone think about a monthly bugfix release on the last
> > > > > major/minor version. (not on every major release, we still don't
> have
> > > > > enough contributors to pull that off IMO). So we would be trying to
> > do
> > > a
> > > > > 0.10.1 early jan for e.g, in this model?
> > > > >
> > > > > [1]
> > > https://cwiki.apache.org/confluence/display/HUDI/Release+Management
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > > >
> > >
> >
>


Regular minor/patch releases

2021-12-13 Thread Vinoth Chandar
Hi all,

In the past we had plans for minor releases [1], but invariably we end up
doing major ones, which also deliver the bug fixes.

The reason was the cost involved in doing a release. We have made some good
progress towards regression/integration test, which prompts me to revive
this.

What does everyone think about a monthly bugfix release on the last
major/minor version. (not on every major release, we still don't have
enough contributors to pull that off IMO). So we would be trying to do a
0.10.1 early jan for e.g, in this model?

[1] https://cwiki.apache.org/confluence/display/HUDI/Release+Management

Thanks
Vinoth


Re: [DISCUSS] Propose Consistent Hashing Indexing for Dynamic Bucket Number

2021-12-13 Thread Vinoth Chandar
+1 on the overall idea.

I am wondering if we can layer this on top of Hash Index as a way for just
expanding the number of buckets.

While Split/Merge sounds great, IMO there is significant operational
overhead to it. Most practical scenarios can be met with ability to expand
with zero impact as you describe it?
In fact, back when I worked on voldemort (linkedin's dynamo impl), we never
shrunk the tables for this reason as well.

In any case, look forward to the RFC. please grab a RFC number!

On Mon, Dec 13, 2021 at 6:24 AM Gary Li  wrote:

> +1, looking forward to the RFC.
>
> Best,
> Gary
>
> On Sun, Dec 12, 2021 at 7:12 PM leesf  wrote:
>
> > +1 for the improvement to make bucket index more comprehensive and
> looking
> > forward to the RFC for more details.
> >
> > Yuwei Xiao  于2021年12月10日周五 16:22写道:
> >
> > > Dear Hudi Community,
> > >
> > > I would like to propose Consistent Hashing Indexing to enable dynamic
> > > bucket number, saving hyper-parameter tuning for Hudi users.
> > >
> > > Currently, we have Bucket Index on landing [1]. It is an effective
> index
> > > approach to address the performance issue during Upsert. I observed ~3x
> > > throughput improvement for Upsert in my local setup compared to the
> Bloom
> > > Filter approach. However, it requires pre-configure a bucket number
> when
> > > creating the table. As described in [1], this imposes two limitations:
> > >
> > > - Due to the one-one mapping between buckets and file groups, the size
> > of a
> > > single file group may grow infinitely. Services like compaction will
> take
> > > longer because of the larger read/write amplification.
> > >
> > > - There may exist data skew because of imbalance data distribution,
> > > resulting in long-tail read/write.
> > >
> > > Based on the above observation, supporting dynamic bucket number is
> > > necessary, especially for rapidly changing hudi tables. Looking at the
> > > market, Consistent Hashing has been adopted in DB systems[2][3]. The
> main
> > > idea of it is to turn the "key->bucket" mapping into
> > > "key->hash_value->(range mapping)->bucket", constraining the re-hashing
> > > process to touch only several local buckets (e.g., only large file
> > groups)
> > > rather than shuffling the whole hash table.
> > >
> > > In order to introduce Consistent Hashing to Hudi, we need to consider
> the
> > > following issues:
> > >
> > > - Storing hashing metadata, such as range mapping infos. Metadata size
> > and
> > > concurrent updates to metadata should also be considered.
> > >
> > > - Splitting & Merging criteria. We need to design a (or several)
> policies
> > > to manage 'when and how to split & merge bucket'. A simple policy would
> > be
> > > splitting in the middle when the file group reaches the size threshold.
> > >
> > > - Supporting concurrent write & read. The splitting or merging must not
> > > block concurrent writer & reader, and the whole process should be fast
> > > enough (e.g., one bucket at a time) to minimize the impact on other
> > > operations.
> > >
> > > - Integrating splitting & merging process into existing hudi table
> > service
> > > pipelines.
> > >
> > > I have sketched a prototype design to address the above problems:
> > >
> > > - Maintain hashing metadata for each partition (persisted as files),
> and
> > > use instant to manage multi-version and concurrent updates of it.
> > >
> > > - A flexible framework will be implemented for different pluggable
> > > policies. The splitting plan, specifying which and how the bucket to
> > split
> > > (merge), will be generated during the scheduling (just like how
> > compaction
> > > does).
> > >
> > > - Dual-write will be activated once the writer observes the
> splitting(or
> > > merging) process, upserting records as log files into both old and new
> > > buckets (file groups). Readers can see records once the writer
> completes,
> > > regardless of the splitting process.
> > >
> > > - The splitting & merging could be integrated as a sub-task into the
> > > Clustering service, because we could view them as a special case of the
> > > Clustering's goal (i.e., managing file groups based on file size).
> Though
> > > we need to modify Clustering to handle log files, the bucket index
> > enhances
> > > Clustering by allowing concurrent updates.
> > >
> > >
> > > Would love to hear your thoughts and any feedback about the proposal. I
> > can
> > > draft an RFC with a detailed design once we reach an agreement.
> > >
> > > [1]
> > >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
> > >
> > > [2] YugabyteDB
> > >
> > >
> >
> https://docs.yugabyte.com/latest/architecture/docdb-sharding/sharding/#example
> > >
> > > [3] PolarDB-X
> > > https://help.aliyun.com/document_detail/316603.html#title-y5n-2i1-5ws
> > >
> > >
> > >
> > > Best,
> > >
> > > Yuwei Xiao
> > >
> >
>


Re: [VOTE] Release 0.10.0, release candidate #3

2021-12-04 Thread Vinoth Chandar
+1 (binding)

Ran the RC checks in [1] . This is a huge release, thanks everyone for all
the hard work!

[1] https://gist.github.com/vinothchandar/68b34f3051e41752ebffd6a3edeb042b

On Sat, Dec 4, 2021 at 5:20 AM Danny Chan  wrote:

> Hi everyone,
>
> Please review and vote on the release candidate #3 for the version 0.10.0,
> as follows:
>
> [ ] +1, Approve the release
>
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
>
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint 9A48922F682AB05D1AE4A3E7C2931E4BDB03D5AE [3],
>
> * all artifacts to be deployed to the Maven Central Repository [4],
>
> * source code tag "release-0.10.0-rc3" [5],
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
>
> Release Manager
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12350285
>
> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc3/
>
> [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
>
> [4]
>
> https://repository.apache.org/content/repositories/orgapachehudi-1048/org/apache/hudi/
>
> [5] https://github.com/apache/hudi/tree/release-0.10.0-rc3
>


Re: [DISCUSS] Move to Spark DataSource V2 API

2021-11-21 Thread Vinoth Chandar
Hi all,

Sorry. Bit late to the party here. +1 on kicking this off and +1 on reusing
the work Raymond has already kickstarted here.
I think we are in a good position to roll with this approach.

The biggest issue with V2 on the writing side, remains that fact that we
cannot really "shuffle" data for precombine or indexing after
we issue spark.write.format(..) ?

I propose the following approach

- Instead of changing the existing V1 datasource, we make a new "*hudi-v2*"
datasource
- Writes in "hudi-v2" just supports bulk_insert or overwrites like what the
spark.parquet.write does.
- We introduce a new `SparkDataSetWriteClient` or equivalent, that offer
programmatic access for all the rich APIs we have currently so users can
use it for delete, insert, update, index, etc...

I think this approach is really flexible. users can use v1 for streaming
updates if they prefer that. It also offers a migration path from v1 to v2
that does not involve breaking the pipelines.

I see that there is a RFC open. Will get on it once my release blockers are
in better shape

Thanks
Vinoth




On Mon, Nov 15, 2021 at 6:26 AM leesf  wrote:

> Thanks Raymond for sharing the work has been done, agree that the 1st
> approach would need more work and time to make it  totally adapted to V2
> interfaces and would be different with different engines. Thus, to abstract
> the Hudi core writing/reading framework to adapt to different engines
> (approach 2 ) looks good to me at the moment since we do not need extra
> work to adapt to other engines and focus on spark writing/reading side.
>
> Raymond Xu  于2021年11月14日周日 下午5:44写道:
>
> > Great initiative and idea, Leesf.
> >
> > Totally agreed on the benefits of adopting V2 APIs. On the 4th point
> "Total
> > use V2 writing interface"
> >
> > I have previously worked on implementing upsert with V2 writing interface
> > with SimpleIndex using broadcast join. The POC worked without fully
> > integrating with other table services. The downside of going this route
> > would be re-implementing most of the logic we have today with the RDD
> > writer path, including different indexing implementations, which are
> > non-trivial.
> >
> > Another route I've PoC'ed is to treat the current RDD writer path as Hudi
> > "writer framework": input Dataset going through different components
> > as we see today Client -> Specific ActionExecutor -> Helper ->
> > (dedup/indexing/tagging/build profile) -> Base Write ActionExecutor ->
> (map
> > partitions and perform write on Row iterator via parquet writer/reader)
> ->
> > return Dataset
> >
> > As you can see, the 1st approach is to adopt an engine-native framework
> (V2
> > writing interface in this case) to realize Hudi operations while the 2nd
> > approach is to adopt the Hudi "writer framework" by using engine-native
> > data-level APIs to realize Hudi operations. The 2nd approach gives better
> > flexibility in adopting different engines; it leverages engines'
> > capabilities to manipulate data while ensuring write operations
> > were realized in the "Hudi" way. The prerequisite to this is to have a
> > flexible Hudi abstraction on top of different engines' data-level APIs.
> > Ethan has landed 2 major abstraction PRs to pave the way for it, which
> will
> > enable a great deal of code-reuse.
> >
> > The Hudi "writer framework" today consists of a bunch of Java classes. It
> > can be formalized and refactored along the way while implementing Row
> > writing. Once the "framework" is formalized, its flexibility can really
> > shine on bringing in new processing engines to Hudi. Something similar
> > could be done on the reader path too I suppose.
> >
> > On Tue, Nov 9, 2021 at 7:55 AM leesf  wrote:
> >
> > > Hi all,
> > >
> > > I did see the community discuss moving to V2 datasource API before [1]
> > but
> > > get no more progress. So I want to bring up the discussion again to
> move
> > to
> > > spark datasource V2 api, Hudi still uses V1 api and relies heavily on
> RDD
> > > api to index, repartition and so on given the flexibility of RDD API.
> > > However V2 api eliminates RDD usage and introduces CatalogPlugin
> > mechanism
> > > to give the ability to manage Hudi tables and totally new writing and
> > > reading interface, so it caused some challenges since Hudi uses the RDD
> > in
> > > both writing and reading path, However I think it is still necessary to
> > > integrate Hudi with V2 api as the V1 api is too old and the benefits
> from
> > > V2 api optimization such as more pushdown filters regarding query side
> to
> > > accelerate the query speed when integrating with RFC-27 [2].
> > >
> > > And here is work I think we should do when moving to V2 api.
> > >
> > > 1. Integrate with V2 writing interface(Bulk_insert row path already
> > > implemented, but not for upsert/insert operations, would fallback to V1
> > > writing code path)
> > > 2. Integrate with V2 reading interface
> > > 3. Introducing CatalogPlugin to manage Hudi tables
> > > 4. Total 

Re: [DISCUSS] Hudi 0.10.0 Release

2021-11-19 Thread Vinoth Chandar
Hi Danny,

I have one blocker. I plan to complete it by end of next week. I am good
with the prior Nov 26 cutoff.
Does that work for everyone?

Thanks
Vinoth

On Fri, Nov 19, 2021 at 12:12 AM Danny Chan  wrote:

> Hi Community,
>
> As we draw close to doing Hudi 0.10.0 release, I am happy to share a
> summary of the key features/improvements that would be going in the release
> and the current blockers for everyone's visibility.
>
> *Highlights*
>
>- [HUDI-1290] Implement Debezium avro source for Delta Streamer
>- [HUDI-1491] Support partition pruning for MOR snapshot query
>- [HUDI-1763] DefaultHoodieRecordPayload does not honor ordering value
>when records within multiple log files are merged
>- [HUDI-1827] Add ORC support in Bootstrap Op
>- [HUDI-1869] Upgrading Spark3 To 3.1
>- [HUDI-2101] support z-order for hudi
>- [HUDI-2276] Enable Metadata Table by default for both writers and
>readers
>- [HUDI-2581] Analyze metadata size estimate in hudi with Hfile for col
>stats partition
>- [HUDI-2634] Improve bootstrap performance for very large tables
>- [HUDI-2086] redo the logical of mor_incremental_view for hive
>- [HUDI-2191] Bump flink version to 1.13.1
>- [HUDI-2285] Metadata Table Synchronous Design
>- [HUDI-2316] Support Flink batch upsert
>- [HUDI-2371] Improve flink streaming reader
>- [HUDI-2394] [Kafka Connect Mileston 1] Implement kafka connect for
>immutable data
>- [HUDI-2449] Incremental read for Flink
>- [HUDI-2562] Embedded timeline server on JobManager
>
> *Current Blockers*
>
>- [HUDI-1856] Upstream changes made in PrestoDB to eliminate file
>listing to Trino (Owner: Sagar Sumit)
>- [HUDI-1912] Presto defaults to GenericHiveRecordCursor for all Hudi
>tables (Owner: Sagar Sumit)
>- [HUDI-1932] Hive Sync should not always update last_commit_time_sync
>(Owner: Raymond Xu)
>- [HUDI-1937] When clustering fail, generating unfinished replacecommit
>timeline. (Owner: Sagar Sumit)
>- [HUDI-2077] Flaky test: TestHoodieDeltaStreamer (Owner: Sagar Sumit)
>- [HUDI-2314] Add DynamoDb based lock provider (Owner: Wenning Ding)
>- [HUDI-2325] Implement and test Hive Sync support for Kafka Connect
>(Owner: Rajesh Mahindra)
>- [HUDI-2332] Implement scheduling of compaction/ clustering for Kafka
>Connect (Owner: Ethan Guo)
>- [HUDI-2362] Hudi external configuration file support (Owner: Wenning
>Ding)
>- [HUDI-2409] Using HBase shaded jars in Hudi presto bundle (Owner:
>Sagar Sumit)
>- [HUDI-2443] KVComparator in HFile for metadata table is tied to HBase
>version and shading (Owner: Sagar Sumit)
>- [HUDI-2472] Tests failure follow up when metadata is enabled by
>default (Owner: Manoj Govindassamy)
>- [HUDI-2475] Rolling Upgrade downgrade story for 0.10 & enabling
>metadata (Owner: Manoj Govindassamy)
>- [HUDI-2478] Handle failure mid-way during init buckets (Owner: Vinoth
>Chandar)
>- [HUDI-2480] FileSlice after pending compaction-requested instant-time
>is ignored by MOR snapshot reader (Owner: Danny Chen)
>- [HUDI-2488] Support bootstrapping a single or more partitions in
>metadata table while regular writers and table services are in progress
>(Owner: Vinoth Chandar)
>- [HUDI-2527] Flaky test:
>
>  TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict
>(Owner: sivabalan narayanan)
>- [HUDI-2559] Ensure unique timestamps are generated for commit times
>with concurrent writers (Owner: sivabalan narayanan)
>- [HUDI-2593] Virtual keys support for metadata table (Owner: Manoj
>Govindassamy)
>- [HUDI-2599] [Performance] Lower parallelism with snapshot query on COW
>tables in Presto (Owner: Sagar Sumit)
>- [HUDI-2628] Fix Chinese Docs (Owner: Kyle Weller)
>- [HUDI-2636] Make release notes discoverable (Owner: Kyle Weller)
>- [HUDI-2637] Triage all bugs around Multi-writer and certify the tested
>flows (Owner: sivabalan narayanan)
>- [HUDI-2641] One inflight commit rolling back other concurrent inflight
>commits causing them to fail (Owner: Udit Mehrotra)
>- [HUDI-2649] Kick off all the Hive query issues for 0.10.0 (Owner:
>Sagar Sumit)
>- [HUDI-2666] async compaction failing with timeline mismatches between
>server and client when metadata is enabled (Owner: Manoj Govindassamy)
>- [HUDI-2667] Avoid fs.exists() and fs.mkdirs() call to partitions in
>AbstractTablefileSystemView (Owner: Sagar Sumit)
>- [HUDI-2671] Fix record offset handling in Kafka connect transaction
>participant (Owner: Rajesh Mahindra)
>- [HUDI-2672] Avoid empty 

Re: [DISCUSS] RFC for Synchronous Metadata table for File listing

2021-11-12 Thread Vinoth Chandar
+1 on this.

On Fri, Nov 5, 2021 at 9:17 AM Sivabalan  wrote:

> RFC-15
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements
> >
> made an attempt to boost performance of file listing by storing all file
> information in metadata table. As we are looking to build more infra around
> metadata table (RFC-27 for data skipping, etc), we felt having a
> synchronous design will make it more tighter and will avoid some of the
> corner cases with async approach.
>
> So, we will write up a new RFC for file listing based on metadata table
> with synchronous updates.
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Hudi Community Communication Updates

2021-11-10 Thread Vinoth Chandar
+1 for this. We will also archive all community activity on ASF
infrastructure this way!

On Wed, Nov 10, 2021 at 7:14 AM Pratyaksh Sharma 
wrote:

> Hi Rajesh,
>
> I do not have any strong opinions for/against point #1.
>
> Point #2 definitely seems useful to me.
> I hope messages from #general channel will be formatted as respective
> threads in either case - if the thread started on the same day or if some
> reply comes on some ongoing thread.
>
> On Tue, Nov 9, 2021 at 10:48 PM Rajesh Mahindra 
> wrote:
>
> > Hi Folks,
> >
> > We are thinking of improving community communication, and here is a list
> of
> > things we propose to do (some of them are already available).
> >
> > https://cwiki.apache.org/confluence/display/HUDI/Community+Communication
> >
> > While we have a few things proposed, we plan to prioritize the following
> > two features to hopefully ease communication. We plan to extend the
> current
> > hudi bot (slack app) to implement the features.
> >
> >
> >1. As you are aware, we have 2 mailing lists: *dev* and *users*. To
> make
> >it easier for users to get these updates on a single platform (slack),
> > we
> >propose to enhance hudi-bot to mirror the emails from the 2 mailing
> > lists
> >and post them into respective slack channels (*#dev* and *#users*).
> The
> >sync will happen in near-real-time (a few mins). To ease readability,
> an
> >email thread will be synced as a single slack thread, i.e., the
> replies
> > on
> >an email will be synced as replies on the slack channel to the
> original
> >post.
> >2. It is often hard to catch up on historical slack messages. To ease
> >catching up on slack messages (*#general*) in the past few days, we
> will
> >further extend the functionality of the hudi bot to read all the slack
> >messages of a day, format them, and send them as a single email digest
> > to
> >the *dev* mailing list.
> >
> >
> >
> > Please let us know what you think of this proposal, and feel free to
> raise
> > concerns if any.
> >
> > Thanks,
> > Rajesh
> >
>


Re: [DISCUSS] RFC-27 for Data skipping/column stats index rewrite -> github RFC

2021-11-08 Thread Vinoth Chandar
+1 from me.
Please preserve the old page, since it has a bunch of discussions in the
community as well

On Fri, Nov 5, 2021 at 3:37 PM Sivabalan  wrote:

> Hey folks,
> We have already put up RFC-27 data skipping/column stats index here
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
> >.
> We have done more  analysis on this end and looking to add/fix more details
> to the RFC. As we have moved to github PRs for RFC process, I will rewrite
> the RFC using the new process and along the way fix the design and impl
> details. High level intent and purpose remains the same, just some
> specifics on the design and implementation to be added/fixed.
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Trino Plugin for Hudi

2021-11-05 Thread Vinoth Chandar
Could we please kick off an RFC for this?

On Thu, Nov 4, 2021 at 8:58 PM sagar sumit  wrote:

> I have created an umbrella JIRA to track this story:
> https://issues.apache.org/jira/browse/HUDI-2687
> Please also join #trino-hudi-connector channel in Hudi Slack for more
> discussion.
>
> Regards,
> Sagar
>
> On Thu, Oct 21, 2021 at 5:38 PM sagar sumit 
> wrote:
>
> > This patch supports snapshot queries on MOR table:
> > https://github.com/trinodb/trino/pull/9641
> > That works with the existing hive connector.
> >
> > Right now, I have only prototyped snapshot queries on COW table with the
> > new hudi connector in https://github.com/codope/trino/tree/hudi-plugin
> > I will be working on supporting the MOR table as well.
> >
> > Regards,
> > Sagar
> >
> > On Wed, Oct 20, 2021 at 4:48 PM Jian Feng  wrote:
> >
> >> When can Trino support snapshot queries on the Merge-on-read table?
> >>
> >> On Mon, Oct 18, 2021 at 9:06 PM 周康  wrote:
> >>
> >> > +1 i have send a message on trino slack, really appreciate for the new
> >> > trino plugin/connector.
> >> > https://trinodb.slack.com/archives/CP1MUNEUX/p1623838591370200
> >> >
> >> > looking forward to the RFC and more discussion
> >> >
> >> > On 2021/10/17 06:06:09 sagar sumit wrote:
> >> > > Dear Hudi Community,
> >> > >
> >> > > I would like to propose the development of a new Trino
> >> plugin/connector
> >> > for
> >> > > Hudi.
> >> > >
> >> > > Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables
> >> and
> >> > > read-optimized queries on Merge-On-Read tables with Trino, through
> the
> >> > > input format based integration in the Hive connector [1
> >> > > ].
> >> This
> >> > > approach has known performance limitations with very large tables,
> >> which
> >> > > has been since fixed on PrestoDB [2
> >> > > ]. We are
> >> > working on
> >> > > replicating the same fixes on Trino as well [3
> >> > > ].
> >> > >
> >> > > However, as Hudi keeps getting better, a new plugin to provide
> access
> >> to
> >> > > Hudi data and metadata will help in unlocking those capabilities for
> >> the
> >> > > Trino users. Just to name a few benefits, metadata-based listing,
> full
> >> > > schema evolution, etc [4
> >> > > <
> >> >
> >>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> >> > >].
> >> > > Moreover, a separate Hudi connector would allow its independent
> >> evolution
> >> > > without having to worry about hacking/breaking the Hive connector.
> >> > >
> >> > > A separate connector also falls in line with our vision [5
> >> > > <
> >> >
> >>
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> >> > >]
> >> > > when we think of a standalone timeline server or a lake cache to
> >> balance
> >> > > the tradeoff between writing and querying. Imagine users having read
> >> and
> >> > > write access to data and metadata in Hudi directly through Trino.
> >> > >
> >> > > I did some prototyping to get the snapshot queries on a Hudi COW
> table
> >> > > working with a new plugin [6
> >> > > ], and I feel the
> >> > effort
> >> > > is worth it. High-level approach is to implement the connector SPI
> [7
> >> > > ] provided
> by
> >> > Trino
> >> > > such as:
> >> > > a) HudiMetadata implements ConnectorMetadata to fetch table
> metadata.
> >> > > b) HudiSplit and HudiSplitManager implement ConnectorSplit and
> >> > > ConnectorSplitManager to produce logical units of data partitioning,
> >> so
> >> > > that Trino can parallelize reads and writes.
> >> > >
> >> > > Let me know your thoughts on the proposal. I can draft an RFC for
> the
> >> > > detailed design discussion once we have consensus.
> >> > >
> >> > > Regards,
> >> > > Sagar
> >> > >
> >> > > References:
> >> > > [1] https://github.com/prestodb/presto/commits?author=vinothchandar
> >> > > [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
> >> > > [3] https://github.com/trinodb/trino/pull/9641
> >> > > [4]
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> >> > > [5]
> >> > >
> >> >
> >>
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> >> > > [6] https://github.com/codope/trino/tree/hudi-plugin
> >> > > [7] https://trino.io/docs/current/develop/connectors.html
> >> > >
> >> >
> >>
> >>
> >> --
> >> *Jian Feng,冯健*
> >> Shopee | Engineer | Data Infrastructure
> >>
> >
>


Re: Limitations of non unique keys

2021-11-05 Thread Vinoth Chandar
Hi Siva,

I think this is more about bloom filters and record level index, which is
different from RFC-27.

RFC-08 talks about record level indexing. Bloom filter indexes have a
discuss thread just kicked off.

Main thing we are trying to solidify in 0.10.0 is foundational
metadata table and concurrency mechanisms to be able to add an index in the
background say.

Thanks
Vinoth

On Fri, Nov 5, 2021 at 8:47 AM Sivabalan  wrote:

> Thanks for bringing this up. We have a RFC-27 on data skipping
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC-27+Data+skipping+index+to+improve+query+performance
> >
> which is the secondary indexing being discussed here. We are flushing out
> few more details on this end and will put up patches once we figure out
> the unknowns. We have a WIP patch here
> <https://github.com/apache/hudi/pull/3475>, but needs some refactoring and
> updates before we its ready for review.
> And we are also thinking of moving the existing bloom filters (from data
> files) into metadata table and re-use them instead of reading from all data
> files with the expectation to boost performance for index lookup. We will
> start a discussion thread around this and go from there.
>
>
>
> On Wed, Nov 3, 2021 at 5:36 PM Nicolas Paris 
> wrote:
>
> >
> > > In another words, we are generalizing this so hudi feels more like
> > > MySQL and not HBase/Cassandra (key value store). Thats the direction
> > > we are approaching.
> >
> > wow this is amazing. I haven't found yet RFC about this, nor ready to
> > test PR.
> >
> > This answer my initial question: with the secondary indexes options
> > comming, the hudi key shall be a primary key (if exists). There is no
> > reason to choose anything else.
> >
> > On Wed Nov 3, 2021 at 9:03 PM CET, Vinoth Chandar wrote:
> > > Hi.
> > >
> > > With the indexing approach we are taking, you should be able to add
> > > secondary indexes on any column. not just the key.
> > > In another words, we are generalizing this so hudi feels more like
> MySQL
> > > and not HBase/Cassandra (key value store). Thats the direction we are
> > > approaching.
> > >
> > > love to hear more feedback.
> > >
> > > On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris  >
> > > wrote:
> > >
> > > > for example does the move of blooms into hfiles (0.10.0 feature)
> makes
> > > > unique bloom keys mandatory ?
> > > >
> > > >
> > > >
> > > > On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
> > > > >
> > > > > > Are you asking if there are advantages to allowing duplicates or
> > not
> > > > having keys in your table?
> > > > > it's all about allowing duplicates
> > > > >
> > > > > use case is say an Order table and choosing key = customer_id
> > > > > then being able to do indexed delete without need of prescanning
> the
> > > > > dataset
> > > > >
> > > > > I wonder if there will be trouble I am unaware of with such trick
> > > > >
> > > > > On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Are you asking if there are advantages to allowing duplicates or
> > not
> > > > > > having
> > > > > > keys in your table?
> > > > > >
> > > > > > Having keys, helps with othe practical scenarios, in addition to
> > what
> > > > > > you
> > > > > > called out.
> > > > > > e.g: Oftentimes, you would want to backfill an insert-only table
> > and
> > > > you
> > > > > > don't want to introduce duplicates when doing so.
> > > > > >
> > > > > > Thanks
> > > > > > Vinoth
> > > > > >
> > > > > > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris <
> > > > nicolas.pa...@riseup.net>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi devs,
> > > > > > >
> > > > > > > AFAIK, hudi has been designed to have primary keys in the
> hudi's
> > key.
> > > > > > > However it is possible to also choose a non unique field. I
> have
> > > > listed
> > > > > > > several trouble with such design:
> > > > > > >
> > > > > > > Non unique key yield to :
> > > > > > > - cannot delete / update a unique record
> > > > > > > - cannot apply primary key for new sql tables feature
> > > > > > >
> > > > > > > Is there other downsides to choose a non unique key you have in
> > mind
> > > > ?
> > > > > > >
> > > > > > > In my case, having user_id as a hudi key will help to apply
> > deletion
> > > > on
> > > > > > > the user level in any user table. The table are insert only, so
> > the
> > > > > > > drawbacks listed above do not really apply. In case of error in
> > the
> > > > > > > tables I have several options:
> > > > > > >
> > > > > > > - rollback to a previous commit
> > > > > > > - read partition/filter overwrite partition
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > >
> > > >
> >
> >
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Metadata based bloom index

2021-11-05 Thread Vinoth Chandar
+1 on this. I think cloud storage throttling is more of an issue that
causes degradations when tables are enormous.
but this approach should nicely handle that as well

On Fri, Nov 5, 2021 at 9:31 AM Manoj Govindassamy <
manoj.govindass...@gmail.com> wrote:

> Hi Hudi Community,
>
> Hudi has several indices to help lookup records. The most commonly used one
> is the BloomFilter based index. This index today works by loading the bloom
> filter from all the data files of interested partitions. This is a time
> consuming operation. Better would be if can leverage the metadata table
> infrastructure of the Hudi tables. That is, if all the bloom filters can be
> loaded directly from a single metadata table partition, it would greatly
> speed up the entire record key lookup process.
>
> Let me know your thoughts on this high level idea. Planning to start a RFC
> on this and I can share more details on the design and implementation.
>
> Regards,
> Manoj
>


Re: Release 0.10.0 planning

2021-11-05 Thread Vinoth Chandar
Let's lock it in, unless someone objects by Monday PST.

On Fri, Nov 5, 2021 at 12:59 AM Gary Li  wrote:

> Nov 26 looks good to me.
>
> Gary
>
> On Wed, Nov 3, 2021 at 10:34 PM Sivabalan  wrote:
>
> > sounds fine by me. Can others (especially PMC and committers) chime in
> here
> > for the proposed date.
> >
> >
> >
> > On Wed, Nov 3, 2021 at 4:11 AM Vinoth Chandar  wrote:
> >
> > > Folks, may be good to push it by a week. Nov 26 can be the RC cut date.
> > >
> > > On Mon, Nov 1, 2021 at 7:41 PM Vinoth Chandar 
> wrote:
> > >
> > > >
> > > > Great! Is everyone good with the nov 19 date? Love to atleast do this
> > > > before nov 26, before holidays kick in!
> > > >
> > > >
> > > > On Mon, Nov 1, 2021 at 7:36 PM Danny Chan 
> > wrote:
> > > >
> > > >> I can take that.
> > > >>
> > > >> Best,
> > > >> Danny
> > > >>
> > > >> Vinoth Chandar  于2021年10月30日周六 上午6:07写道:
> > > >>
> > > >> > Hi all,
> > > >> >
> > > >> > I propose we cut the RC for 0.10.0 by Nov 19.
> > > >> >
> > > >> > Any volunteers for release manager?
> > > >> >
> > > >> > Thanks
> > > >> > Vinoth
> > > >> >
> > > >> > On Sun, Oct 17, 2021 at 10:45 AM Sivabalan 
> > > wrote:
> > > >> >
> > > >> > > This release has a lot of exciting features lined up. Eagerly
> > > looking
> > > >> > > forward to it.
> > > >> > >
> > > >> > > On Thu, Oct 14, 2021 at 1:17 PM Vinoth Chandar <
> vin...@apache.org
> > >
> > > >> > wrote:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > It's time for our next release again!
> > > >> > > >
> > > >> > > > I have marked out some blockers here on JIRA.
> > > >> > > >
> > > >> > > >
> https://issues.apache.org/jira/projects/HUDI/versions/12350285
> > > >> > > >
> > > >> > > >
> > > >> > > > Quick highlights:
> > > >> > > > - Metadata table v2, which is synchronously updated
> > > >> > > > - Row writing (Spark) for all write operations
> > > >> > > > - Kafka Connect for append only data model
> > > >> > > > - New indexing schemes moving bloom filters and file range
> > footers
> > > >> into
> > > >> > > > metadata table to improve upsert/delete performance.
> > > >> > > > - Fixes needed for Trino/Presto support.
> > > >> > > > - Most of the "big-needle-mover" PRs that are up already.
> > > >> > > > - Revamp of docs to match our vision.
> > > >> > > >
> > > >> > > > May need some help understanding all the Flink related
> changes.
> > > >> > > >
> > > >> > > > Kindly review and let's use this thread to ratify and discuss
> > > >> > timelines.
> > > >> > > >
> > > >> > > >
> > > >> > > > Thanks
> > > >> > > > Vinoth
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Regards,
> > > >> > > -Sivabalan
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


Re: Limitations of non unique keys

2021-11-03 Thread Vinoth Chandar
Hi.

With the indexing approach we are taking, you should be able to add
secondary indexes on any column. not just the key.
In another words, we are generalizing this so hudi feels more like MySQL
and not HBase/Cassandra (key value store). Thats the direction we are
approaching.

love to hear more feedback.

On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris 
wrote:

> for example does the move of blooms into hfiles (0.10.0 feature) makes
> unique bloom keys mandatory ?
>
>
>
> On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
> >
> > > Are you asking if there are advantages to allowing duplicates or not
> having keys in your table?
> > it's all about allowing duplicates
> >
> > use case is say an Order table and choosing key = customer_id
> > then being able to do indexed delete without need of prescanning the
> > dataset
> >
> > I wonder if there will be trouble I am unaware of with such trick
> >
> > On Thu Oct 28, 2021 at 2:33 PM CEST, Vinoth Chandar wrote:
> > > Hi,
> > >
> > > Are you asking if there are advantages to allowing duplicates or not
> > > having
> > > keys in your table?
> > >
> > > Having keys, helps with othe practical scenarios, in addition to what
> > > you
> > > called out.
> > > e.g: Oftentimes, you would want to backfill an insert-only table and
> you
> > > don't want to introduce duplicates when doing so.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, Oct 26, 2021 at 1:37 AM Nicolas Paris <
> nicolas.pa...@riseup.net>
> > > wrote:
> > >
> > > > Hi devs,
> > > >
> > > > AFAIK, hudi has been designed to have primary keys in the hudi's key.
> > > > However it is possible to also choose a non unique field. I have
> listed
> > > > several trouble with such design:
> > > >
> > > > Non unique key yield to :
> > > > - cannot delete / update a unique record
> > > > - cannot apply primary key for new sql tables feature
> > > >
> > > > Is there other downsides to choose a non unique key you have in mind
> ?
> > > >
> > > > In my case, having user_id as a hudi key will help to apply deletion
> on
> > > > the user level in any user table. The table are insert only, so the
> > > > drawbacks listed above do not really apply. In case of error in the
> > > > tables I have several options:
> > > >
> > > > - rollback to a previous commit
> > > > - read partition/filter overwrite partition
> > > >
> > > > Thanks
> > > >
>
>


Re: feature request/proposal: leverage bloom indexes for readingb

2021-11-03 Thread Vinoth Chandar
Hi,

You are right about the datasource API. This is one of the mismatches that
prevents us from exposing this more nicely.

we are definitely going the route of having a select query taking hints and
using index for faster lookup. 0.11 we could try once we have the new multi
modal indexing landing.

For now, actually you can use ReadClient, it will work fine. I think we
used it internally at uber.


On Thu, Oct 28, 2021 at 10:22 AM Nicolas Paris 
wrote:

> I tested the HoodieReadClient. It's a great start indeed. Looks like
> this client is meant fo testing purpose and needs some enhancement. I
> will try to produce a general purpose code aroud this and who knows
> contribute.
>
> I guess the datasource api is not the best candidate since hudi keys
> cannot be passed as options but with rdd or df:
>
> sprark.read.format('hudi').option('hudi.filter.keys',
> 'a,flat,list,of,keys,not,really,cool').load(...)
>
> there is also the option to introduce a new hudi operation such
> "select". but again it's not supposed to return a dataframe but write to
> the hudi:
>
> df_hudi_keys.options(**hudi_options).save(...)
>
> Then a full featured / documented hoodie client is maybe the best option
>
>
> thought ?
>
>
> On Thu Oct 28, 2021 at 2:34 PM CEST, Vinoth Chandar wrote:
> > Sounds great!
> >
> > On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris 
> > wrote:
> >
> > > Hi Vinoth,
> > >
> > > Thanks for the starter. Definitely once the new way to manage indexes
> > > and we get migrated on hudi on our datalake, I d'be glad to give this a
> > > shot.
> > >
> > >
> > > Regards, Nicolas
> > >
> > > On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> > > > Hi Nicolas,
> > > >
> > > > Thanks for raising this! I think it's a very valid ask.
> > > > https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
> > > >
> > > > As a proof of concept, would you be able to give filterExists() a
> shot
> > > > and
> > > > see if the filtering time improves?
> > > >
> > >
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
> > > >
> > > > In the upcoming 0.10.0 release, we are planning to move the bloom
> > > > filters
> > > > out to a partition on the metadata table, to even speed this up for
> very
> > > > large tables.
> > > > https://issues.apache.org/jira/browse/HUDI-1295
> > > >
> > > > Please let us know if you are interested in testing that when the PR
> is
> > > > up.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris <
> nicolas.pa...@riseup.net>
> > > > wrote:
> > > >
> > > > > hi !
> > > > >
> > > > > In my use case, for GDPR I have to export all informations of a
> given
> > > > > user from several hudi HUGE tables. Filtering the table results in
> a
> > > > > full scan of around 10 hours and this will get worst year after
> year.
> > > > >
> > > > > Since the filter criteria is based on the bloom key (user_id) it
> would
> > > > > be handy to exploit the bloom and produce a temporary table (in the
> > > > > metastore for eg) with the resulting rows.
> > > > >
> > > > > So far the bloom indexing is used for update/delete operations on a
> > > hudi
> > > > > table.
> > > > >
> > > > > 1. There is a oportunity to exploit the bloom for select
> operations.
> > > > > the hudi options would be:
> > > > > operation: select
> > > > > result-table: 
> > > > > result-path: 
> > > > > result-schema:  (optional ; when empty
> no
> > > > > sync with the hms, only raw path)
> > > > >
> > > > >
> > > > > 2. It could be implemented as predicate push down in the spark
> > > > > datasource API. When filtering with a IN statement.
> > > > >
> > > > >
> > > > > Thought ?
> > > > >
> > >
> > >
>
>


Re: Monthly or Bi-Monthly Dev meeting?

2021-11-03 Thread Vinoth Chandar
Hi all,

I have shared some times and setup as discussed in this PR.
https://github.com/apache/hudi/pull/3914

Thanks
Vinoth

On Fri, Oct 22, 2021 at 10:50 PM Pratyaksh Sharma 
wrote:

> I can save them all on my external hard disk. :)
>
> On Fri, Oct 22, 2021 at 8:04 PM Vinoth Chandar  wrote:
>
> > We could, but just need storage space over the longer term. :)
> >
> > On Wed, Oct 20, 2021 at 9:56 PM Raymond Xu 
> > wrote:
> >
> > > Timing looks ok. Are we going to record the sessions too?
> > >
> > > On Wed, Oct 20, 2021 at 7:17 PM Vinoth Chandar 
> > wrote:
> > >
> > > > I think we can do 7AM PST winters and 8AM summers.
> > > > Will draft a page with a zoom link we can use and put up a PR.
> > > >
> > > >
> > > > On Thu, Oct 14, 2021 at 9:48 AM Vinoth Chandar 
> > > wrote:
> > > >
> > > > > Yes. I can do 7AM PST. Can others in PST chime in please?
> > > > >
> > > > > We can wrap this up this week.
> > > > >
> > > > > On Tue, Oct 12, 2021 at 7:25 PM Gary Li  wrote:
> > > > >
> > > > >> Hi Vinoth,
> > > > >>
> > > > >> Summertime 8 AM PST was 11 PM in China so I guess it works for
> some
> > > > forks,
> > > > >> but switching to wintertime it was 12 AM in China. It might be a
> bit
> > > > late
> > > > >> IMO. Does 3 PM UTC(7 AM PST in winter, 8 AM in summer) work?
> > > > >>
> > > > >> Best,
> > > > >> Gary
> > > > >>
> > > > >> On Tue, Oct 5, 2021 at 9:20 PM Pratyaksh Sharma <
> > > pratyaks...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Works for me in India :)
> > > > >> >
> > > > >> > On Tue, Oct 5, 2021 at 9:41 AM Vinoth Chandar <
> vin...@apache.org>
> > > > >> wrote:
> > > > >> >
> > > > >> > > Looks like there is enough interest here.
> > > > >> > >
> > > > >> > > Moving onto timing. Does 8AM PST, on the second thursday of
> > every
> > > > >> > > month work for everyone?
> > > > >> > > This is the time I find, works best for most time zones.
> > > > >> > >
> > > > >> > > On Thu, Sep 23, 2021 at 1:15 PM Y Ethan Guo <
> > > > ethan.guoyi...@gmail.com
> > > > >> >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > +1 on monthly community sync.
> > > > >> > > >
> > > > >> > > > On Thu, Sep 23, 2021 at 12:32 PM Udit Mehrotra <
> > > udi...@apache.org
> > > > >
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > > +1 for the monthly meeting. It would be great to start
> > syncing
> > > > up
> > > > >> > > > > again. Thanks Vinoth for bringing it up !
> > > > >> > > > >
> > > > >> > > > > On Thu, Sep 23, 2021 at 12:14 PM Sivabalan <
> > > n.siv...@gmail.com>
> > > > >> > wrote:
> > > > >> > > > > >
> > > > >> > > > > > +1 on monthly meet up.
> > > > >> > > > > >
> > > > >> > > > > > On Thu, Sep 23, 2021 at 11:01 AM vino yang <
> > > > >> yanghua1...@gmail.com>
> > > > >> > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > +1 for monthly
> > > > >> > > > > > >
> > > > >> > > > > > > Best,
> > > > >> > > > > > > Vino
> > > > >> > > > > > >
> > > > >> > > > > > > Pratyaksh Sharma 
> 于2021年9月23日周四
> > > > >> 下午9:36写道:
> > > > >> > > > > > >
> > > > >> > > > > > > > Monthly should be good. Been a long time since we
> > > > connected
> > > > >> in
> > > > >> > > > these
> > > > >> > > > > > > > meetings. :)
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Thu, Sep 23, 2021 at 7:02 PM Vinoth Chandar <
> > > > >> > > > > > > > mail.vinoth.chan...@gmail.com> wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > 1 hour monthly is what I was proposing to be
> > specific.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Thu, Sep 23, 2021 at 6:30 AM Gary Li <
> > > > >> gar...@apache.org>
> > > > >> > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > +1 for monthly.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > On Thu, Sep 23, 2021 at 8:28 PM Vinoth Chandar <
> > > > >> > > > > vin...@apache.org>
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > > Hi all,
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Once upon a time, we used to have a weekly
> > > community
> > > > >> > sync.
> > > > >> > > > > > > Wondering
> > > > >> > > > > > > > if
> > > > >> > > > > > > > > > > there is interest in having a monthly or
> > > bi-monthly
> > > > >> dev
> > > > >> > > > > meeting?
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Agenda could be
> > > > >> > > > > > > > > > > - Update/Summary of all dev work tracks
> > > > >> > > > > > > > > > > - Show and tell, where people can present
> their
> > > > >> ongoing
> > > > >> > > work
> > > > >> > > > > > > > > > > - Open floor discussions, bring up new issues.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Thanks
> > > > >> > > > > > > > > > > Vinoth
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > --
> > > > >> > > > > > Regards,
> > > > >> > > > > > -Sivabalan
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>


Re: Release 0.10.0 planning

2021-11-03 Thread Vinoth Chandar
Folks, may be good to push it by a week. Nov 26 can be the RC cut date.

On Mon, Nov 1, 2021 at 7:41 PM Vinoth Chandar  wrote:

>
> Great! Is everyone good with the nov 19 date? Love to atleast do this
> before nov 26, before holidays kick in!
>
>
> On Mon, Nov 1, 2021 at 7:36 PM Danny Chan  wrote:
>
>> I can take that.
>>
>> Best,
>> Danny
>>
>> Vinoth Chandar  于2021年10月30日周六 上午6:07写道:
>>
>> > Hi all,
>> >
>> > I propose we cut the RC for 0.10.0 by Nov 19.
>> >
>> > Any volunteers for release manager?
>> >
>> > Thanks
>> > Vinoth
>> >
>> > On Sun, Oct 17, 2021 at 10:45 AM Sivabalan  wrote:
>> >
>> > > This release has a lot of exciting features lined up. Eagerly looking
>> > > forward to it.
>> > >
>> > > On Thu, Oct 14, 2021 at 1:17 PM Vinoth Chandar 
>> > wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > It's time for our next release again!
>> > > >
>> > > > I have marked out some blockers here on JIRA.
>> > > >
>> > > > https://issues.apache.org/jira/projects/HUDI/versions/12350285
>> > > >
>> > > >
>> > > > Quick highlights:
>> > > > - Metadata table v2, which is synchronously updated
>> > > > - Row writing (Spark) for all write operations
>> > > > - Kafka Connect for append only data model
>> > > > - New indexing schemes moving bloom filters and file range footers
>> into
>> > > > metadata table to improve upsert/delete performance.
>> > > > - Fixes needed for Trino/Presto support.
>> > > > - Most of the "big-needle-mover" PRs that are up already.
>> > > > - Revamp of docs to match our vision.
>> > > >
>> > > > May need some help understanding all the Flink related changes.
>> > > >
>> > > > Kindly review and let's use this thread to ratify and discuss
>> > timelines.
>> > > >
>> > > >
>> > > > Thanks
>> > > > Vinoth
>> > > >
>> > >
>> > >
>> > > --
>> > > Regards,
>> > > -Sivabalan
>> > >
>> >
>>
>


Re: Release 0.10.0 planning

2021-11-01 Thread Vinoth Chandar
Great! Is everyone good with the nov 19 date? Love to atleast do this
before nov 26, before holidays kick in!


On Mon, Nov 1, 2021 at 7:36 PM Danny Chan  wrote:

> I can take that.
>
> Best,
> Danny
>
> Vinoth Chandar  于2021年10月30日周六 上午6:07写道:
>
> > Hi all,
> >
> > I propose we cut the RC for 0.10.0 by Nov 19.
> >
> > Any volunteers for release manager?
> >
> > Thanks
> > Vinoth
> >
> > On Sun, Oct 17, 2021 at 10:45 AM Sivabalan  wrote:
> >
> > > This release has a lot of exciting features lined up. Eagerly looking
> > > forward to it.
> > >
> > > On Thu, Oct 14, 2021 at 1:17 PM Vinoth Chandar 
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > It's time for our next release again!
> > > >
> > > > I have marked out some blockers here on JIRA.
> > > >
> > > > https://issues.apache.org/jira/projects/HUDI/versions/12350285
> > > >
> > > >
> > > > Quick highlights:
> > > > - Metadata table v2, which is synchronously updated
> > > > - Row writing (Spark) for all write operations
> > > > - Kafka Connect for append only data model
> > > > - New indexing schemes moving bloom filters and file range footers
> into
> > > > metadata table to improve upsert/delete performance.
> > > > - Fixes needed for Trino/Presto support.
> > > > - Most of the "big-needle-mover" PRs that are up already.
> > > > - Revamp of docs to match our vision.
> > > >
> > > > May need some help understanding all the Flink related changes.
> > > >
> > > > Kindly review and let's use this thread to ratify and discuss
> > timelines.
> > > >
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>


Re: Release 0.10.0 planning

2021-10-29 Thread Vinoth Chandar
Hi all,

I propose we cut the RC for 0.10.0 by Nov 19.

Any volunteers for release manager?

Thanks
Vinoth

On Sun, Oct 17, 2021 at 10:45 AM Sivabalan  wrote:

> This release has a lot of exciting features lined up. Eagerly looking
> forward to it.
>
> On Thu, Oct 14, 2021 at 1:17 PM Vinoth Chandar  wrote:
>
> > Hi all,
> >
> > It's time for our next release again!
> >
> > I have marked out some blockers here on JIRA.
> >
> > https://issues.apache.org/jira/projects/HUDI/versions/12350285
> >
> >
> > Quick highlights:
> > - Metadata table v2, which is synchronously updated
> > - Row writing (Spark) for all write operations
> > - Kafka Connect for append only data model
> > - New indexing schemes moving bloom filters and file range footers into
> > metadata table to improve upsert/delete performance.
> > - Fixes needed for Trino/Presto support.
> > - Most of the "big-needle-mover" PRs that are up already.
> > - Revamp of docs to match our vision.
> >
> > May need some help understanding all the Flink related changes.
> >
> > Kindly review and let's use this thread to ratify and discuss timelines.
> >
> >
> > Thanks
> > Vinoth
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: New site/docs navigation

2021-10-28 Thread Vinoth Chandar
Awesome!
I think Kyle has already fixed some issues around cn docs in the PR above.
Could you review that?
Kyle, if you are here, please chime in. We can organize all the work under
a single umbrella JIRA.
https://issues.apache.org/jira/browse/HUDI-270 so its easier for any
volunteers to pick up?

On Thu, Oct 28, 2021 at 6:21 AM Shawy Geng  wrote:

> Hi Vinoth,
>
> Volunteer to update the Chinese doc. Already commented at the
> https://issues.apache.org/jira/browse/HUDI-2628 <
> https://issues.apache.org/jira/browse/HUDI-2628>.
> Are there any other volunteers who want to work together to translate?
> Please contact me.
>
> > 2021年10月28日 20:35,Vinoth Chandar  写道:
> >
> > Hi all,
> >
> > https://github.com/apache/hudi/pull/3855 puts up a nice redesign of the
> > content, that can show case all of the Hudi capabilities. Please chime in
> > and help merge the PR.
> >
> > As follow on, we can also fix the Chinese site docs after this?
> >
> > Thanks
> > Vinoth
>
>


New site/docs navigation

2021-10-28 Thread Vinoth Chandar
Hi all,

https://github.com/apache/hudi/pull/3855 puts up a nice redesign of the
content, that can show case all of the Hudi capabilities. Please chime in
and help merge the PR.

As follow on, we can also fix the Chinese site docs after this?

Thanks
Vinoth


  1   2   3   4   5   6   7   8   9   10   >