What precombine field really is used for and its future?

2023-03-31 Thread Daniel Kaźmirski
Hi all,

I would like to bring up the topic of how precombine field is used and
what's the purpose of it. I would also like to know what are the plans for
it in the future.

At first glance precombine filed looks like it's only used to deduplicate
records in incoming batch.
But when digging deeper it looks like it can/is also be used to:
1. combine records not before but on write to decide if update existing
record (eg with DefaultHoodieRecordPayload)
2. combine records on read for MoR table to combine log and base files
correctly.
3. precombine field is required for spark SQL UPDATE, even if user can't
provide duplicates anyways with this sql statement.

Regarding [3] there's inconsistency as precombine field is not required in
MERGE INTO UPDATE. Underneath UPSERT is switched to INSERT in upsert mode
to update existing records.

I know that Hudi does a lot of work to ensure PK uniqueness across/within
partitions and there is a need to deduplicate records before write or to
deduplicate existing data if duplicates were introduced eg when using
non-strict insert mode.

What should then happen in a situation where user does not want or can not
provide a pre-combine field? Then it's on user not to introduce duplicates,
but makes Hudi more generic and easier to use for "SQL" people.

No precombine is possible for CoW, already, but UPSERT and SQL UPDATE is
not supported (but users can update records using Insert in non-strict mode
or MERGE INTO UPDATE).
There's also a difference between CoW and MoR where for MoR
precombine field is a hard requirement, but is optional for CoW.
(UPDATES with no precombine are also possible in Flink for both CoW and MoR
but not in Spark.)

Would it make sense to take inspiration from some DBMS systems then (eg
Synapse) to allow updates and upserts when no precombine field is specified?
Scenario:
Say that duplicates were introduced with Insert in non-strict mode, no
precombine field is specified, then we have two options:
option 1) on UPDATE/UPSERT Hudi should deduplicate the existing records, as
there's no precombine field it's expected we don't know which records will
be removed and which will be effectively updated and preserved in the
table. (This can be also achieved by always providing the same value in
precombine field for all records.)
option 2) on UPDATE/UPSERT Hudi should deduplicate the existing records, as
there's no precombine field, record with the latest _hoodie_commit_time is
preserved and updated, other records with the same PK are removed.

In both cases, deduplication on UPDATE/UPSERT becomes a hard rule
whether we use precombine field or not.

Then regarding MoR and merging records on read (found this in Hudi format
spec), can it be done by only using _hoodie_commit_time in absence of
precombine field?
If so for both MoR and CoW precombine field can become completely optional?

I'm of course looking at it more from the user perspective, it would be
nice to know what is and what is not possible from the design and developer
perspective.

Best Regards,
Daniel Kaźmirski


Schema evolution strategies

2023-03-31 Thread Daniel Kaźmirski
Hi folks,

I would like to discuss the topic of schema evolution in Hudi as I think we
could improve user experience here a bit.

Currently, we have two schema evolution "modes" available:
1. "old" out of the box schema evolution rules,
2. "new" schema on read evolution rules.

Out of the box schema evolution allows us to add nullable columns at the
end of a struct (root or nested), we can not modify column order in the
incoming batch and Hudi will not resolve it for the user. It also allows
for some limited data types evolution.

On the other hand schema on read evolution rules allow adding, reordering
and dropping columns. It also allows for pretty flexible data types
evolution.

>From my experience and from some discussions in Hudi slack emerge a few
use-cases for schema evolution. I want to focus on "new" schema on read
context.

1. Automatic/dynamic schema evolution - in this mode user can provide
partial record schema, and table schema will be automatically evolved so
the user can "just" write to Hudi (but honoring schema evolution rules).
This could be supported for data frame write (upsert, insert) and MERGE
INTO SQL statement for both INSERT *, UPDATE * as well as for partial
inserts and updates used in MERGE INTO statement. No columns should
be dropped from the table schema. Reordering columns should be taken care
of etc.

2. Enforce schema on write - in some use-cases users do not want to evolve
table schema automatically. In this case, the target table schema should be
used on write.

Currently, it feels like Hudi is not fully consistent in this matter:
- MERGE INTO enforces schema on write (target table schema) and drops
additional columns if needed,
- for UPSERT/INSERT when schema on read and reconcile schema are enabled it
does automatic schema evolution (missing columns are added, schema is
resolved tobe compatible target table schema).
- for UPSERT/INSERT when out-of-the box schema evolution is used and
reconcile schema is enabled wider schema is accepted and target table
schema is evolved accordingly or if the incoming schema is narrower, then
the latest table schema is used. There are issues though when a new column
is added and a column is missing or if column order is mixed in the
incoming batch.

>From the user perspective, it would be good to focus on the new schema on
read evolution rules and introduce a new config:
hoodie.schema.evolution.strategy: merge [1] or enforce [2]

That being said it can be a good idea to preserve reconcile.schema config
just for out-of-the box schema evolution scenarios and to preserve behavior.


Best Regards,
Daniel Kaźmirski


Re: Re: [DISCUSS] Hudi data TTL

2023-03-31 Thread Sivabalan
left some comments. thanks!

On Fri, 31 Mar 2023 at 00:59, 符其军 <18889897...@163.com> wrote:

> Hi community, we have submitted RFC-65 Partition TTL Management in this
> pr: https://github.com/apache/hudi/pull/8062.Let me know if you
> have any questions or concerns with this proposal.
> At 2022-10-21 14:42:10, "stream2000" <18889897...@163.com> wrote:
> >Yes we can have a talk about it. We will try our best to write the RFC,
> maybe publish it in a few weeks.
> >
> >
> >> On Oct 21, 2022, at 10:18, JerryYue <272614...@qq.com.INVALID> wrote:
> >>
> >> Looking forward to the RFC
> >> It's a good idea, we also need hudi data TTL in some case
> >> Do we have any plan or time to do this? We also had some simple designs
> to implement it
> >> Maybe we can had a talk about it
> >>
> >> 在 2022/10/20 上午9:47,“Bingeng Huang” qq@hudi.apache.org 代表 hbgstc...@gmail.com> 写入:
> >>
> >>Looking forward to the RFC.
> >>We can propose RFC about support TTL config using non-partition
> field after
> >>
> >>
> >>
> >>sagar sumit  于2022年10月19日周三 14:42写道:
> >>
> >>> +1 Very nice idea. Looking forward to the RFC!
> >>>
> >>> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu <
> xu.shiyan.raym...@gmail.com>
> >>> wrote:
> >>>
>  great proposal. Partition TTL is a good starting point. we can extend
> it
> >>> to
>  other TTL strategies like column-based, and make it customizable and
>  pluggable. Looking forward to the RFC!
> 
>  On Wed, Oct 19, 2022 at 11:40 AM Jian Feng
>  
>  wrote:
> 
> > Good idea,
> > this is definitely worth an  RFC
> > btw should it only depend on Hudi's partition? I feel it should be a
> >>> more
> > common feature since sometimes customers' data can not update across
> > partitions
> >
> >
> > On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com>
> >>> wrote:
> >
> >> Hi all, we have implemented a partition based data ttl management,
>  which
> >> we can manage ttl for hudi partition by size, expired time and
> >> sub-partition count. When a partition is detected as outdated, we
> use
> >> delete partition interface to delete it, which will generate a
> >>> replace
> >> commit to mark the data as deleted. The real deletion will then done
> >>> by
> >> clean service.
> >>
> >>
> >> If community is interested in this idea, maybe we can propose a RFC
> >>> to
> >> discuss it in detail.
> >>
> >>
> >>> On Oct 19, 2022, at 10:06, Vinoth Chandar 
> >>> wrote:
> >>>
> >>> +1 love to discuss this on a RFC proposal.
> >>>
> >>> On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
> >> wrote:
> >>>
>  That's a very interesting idea.
> 
>  Do you want to take a stab at writing a full proposal (in the form
>  of
> >> RFC)
>  for it?
> 
>  On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang <
> >>> hbgstc...@gmail.com
> >
>  wrote:
> 
> > Hi all,
> >
> > Do we have plan to integrate data TTL into HUDI, so we don't have
>  to
> > schedule a offline spark job to delete outdated data, just set a
>  TTL
> > config, then writer or some offline service will delete old data
> >>> as
> > expected.
> >
> 
> >>
> >>
> >
> > --
> > *Jian Feng,冯健*
> > Shopee | Engineer | Data Infrastructure
> >
> 
> 
>  --
>  Best,
>  Shiyan
> 
> >>>
> >>
>


-- 
Regards,
-Sivabalan


[VOTE] Release 0.12.3, release candidate #1

2023-03-31 Thread Sivabalan
Hi everyone,

Please review and vote on the release candidate #1 for the version 0.12.3,
as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:

* JIRA release notes [1],
* the official Apache source release and binary convenience releases to be
deployed to dist.apache.org [2], which are signed with the key with
fingerprint ACD52A06633DB3B2C7D0EA5642CA2D3ED5895122 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "release-0.12.3-rc1" [5],

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Release Manager


[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12352934=Html=12322822
[2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.12.3-rc1/
[3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
[4] https://repository.apache.org/content/repositories/orgapachehudi-1119
[5] https://github.com/apache/hudi/releases/tag/release-0.12.3-rc1

-- 
Regards,
-Sivabalan


Re: When using the HoodieDeltaStreamer, is there a corresponding parameter that can control the number of cycles? For example, if I cycle 5 times, I stop accessing data

2023-03-31 Thread Sivabalan
We do have Graceful termination possibility w/ deltastreamer
continuous mode. Please check here

for post write termination strategy. You can implement your own termination
strategy. Hope that helps.

On Thu, 30 Mar 2023 at 20:16, Vinoth Chandar  wrote:

> I believe there is no control today. You could hack a precommit validator
> and call System.exit if you want ;) (ugly, I know)
>
> But maybe we could introduce some abstraction to do a check between loops?
> or allow users to plugin some logic to decide whether to continue or exit?
>
> Love to understand the use-case more here.
>
> On Wed, Mar 29, 2023 at 7:32 AM lee  wrote:
>
> > When I use the HoodieDeltaStreamer, the "-- continuous" parameter: "Delta
> > Streamer runs in continuous mode running source match ->Transform ->Hudi
> > Write in loop". So I would like to ask if there are any corresponding
> > parameters that can control the number of cycles, such as stopping
> > accessing data when I cycle 5 times.
> >
> >
> >
> > 李杰
> > leedd1...@163.com
> >
> > <
> https://dashi.163.com/projects/signature-manager/detail/index.html?ftlId=1=%E6%9D%8E%E6%9D%B0=leedd1912%40163.com=https%3A%2F%2Fmail-online.nosdn.127.net%2Fsmc4215b668fdb6b5ca355a1c3319c4a0e.jpg=%5B%22leedd1912%40163.com%22%5D
> >
> >
>


-- 
Regards,
-Sivabalan


Re:Re: Re: Re: DISCUSS

2023-03-31 Thread 吕虎
Hi Vinoth, I'm glad to receive your letter. Here are some of my thoughts.
At 2023-03-31 10:17:52, "Vinoth Chandar"  wrote:
>I think we can focus more on validating the hash index + bloom filter vs
>consistent hash index more first. Have you looked at RFC-08, which is a
>kind of hash index as well, except it stores the key => file group mapping
>externally.

  The idea of RFC-08  Index (rowKey ->pationPath, fileID) is very similar 
to HBase index, but its index is implemented internally in Hudi, so there is no 
need to worry about consistency issues. Index  can be written to HFiles 
quickly, but when reading, it is necessary to read from multiple HFiles, so the 
performance of reading an index can be a problem. Therefore, RFC proposer 
naturally thought of using hash buckets to partially solve this problems. 
HBase's solution to multiple HFILE files is to add a maximum and minimum index 
and a Bloom filter index. In Hudi, you can directly create a maximum and 
minimum index and a Bloom filter index for FileGroups, eliminating the need to 
store the index in HFILE; Another solution is to do a compaction on HFILE 
files, but it also adds a burden to hudi.We need to consider the performance of 
reading HFile well when using RFC-08.

Therefore, I believe that hash partition  + bloom filter is still the simplest 
and most effective solution for predictable data growth in a small range.

















At 2023-03-31 10:17:52, "Vinoth Chandar"  wrote:
>I think we can focus more on validating the hash index + bloom filter vs
>consistent hash index more first. Have you looked at RFC-08, which is a
>kind of hash index as well, except it stores the key => file group mapping
>externally.
>
>On Fri, Mar 24, 2023 at 2:14 AM 吕虎  wrote:
>
>> Hi Vinoth, I am very happy to receive your reply. Here are some of my
>> thoughts。
>>
>> At 2023-03-21 23:32:44, "Vinoth Chandar"  wrote:
>> >>but when it is used for data expansion, it still involves the need to
>> >redistribute the data records of some data files, thus affecting the
>> >performance.
>> >but expansion of the consistent hash index is an optional operation right?
>>
>> >Sorry, not still fully understanding the differences here,
>> I'm sorry I didn't make myself clearly. The expansion I mentioned last
>> time refers to data records increase in hudi table.
>> The difference between consistent hash index and hash partition with Bloom
>> filters index is how to deal with  data increase:
>> For consistent hash index, the way of splitting the file is used.
>> Splitting files affects performance, but can permanently work effectively.
>> So consistent hash index is  suitable for scenarios where data increase
>> cannot be estimated or  data will increase large.
>> For hash partitions with Bloom filters index, the way of creating  new
>> files is used. Adding new files does not affect performance, but if there
>> are too many files, the probability of false positives in the Bloom filters
>> will increase. So hash partitions with Bloom filters index is  suitable for
>> scenario where data increase can be estimated over a relatively small range.
>>
>>
>> >>Because the hash partition field values under the parquet file in a
>> >columnar storage format are all equal, the added column field hardly
>> >occupies storage space after compression.
>> >Any new meta field added adds other overhead in terms evolving the schema,
>> >so forth. are you suggesting this is not possible to do without a new meta
>> >field?
>>
>> No new meta field  implementation is a more elegant implementation, but
>> for me, who is not yet familiar with the Hudi source code, it is somewhat
>> difficult to implement, but it is not a problem for experts. If you want to
>> implement it without adding new meta fields, I hope I can participate in
>> some simple development, and I can also learn how experts can do it.
>>
>>
>> >On Thu, Mar 16, 2023 at 2:22 AM 吕虎  wrote:
>> >
>> >> Hello,
>> >>  I feel very honored that you are interested in my views.
>> >>
>> >>  Here are some of my thoughts marked with blue font.
>> >>
>> >> At 2023-03-16 13:18:08, "Vinoth Chandar"  wrote:
>> >>
>> >> >Thanks for the proposal! Some first set of questions here.
>> >> >
>> >> >>You need to pre-select the number of buckets and use the hash
>> function to
>> >> >determine which bucket a record belongs to.
>> >> >>when building the table according to the estimated amount of data,
>> and it
>> >> >cannot be changed after building the table
>> >> >>When the amount of data in a hash partition is too large, the data in
>> >> that
>> >> >partition will be split into multiple files in the way of Bloom index.
>> >> >
>> >> >All these issues are related to bucket sizing could be alleviated by
>> the
>> >> >consistent hashing index in 0.13? Have you checked it out? Love to hear
>> >> >your thoughts on this.
>> >>
>> >> Hash partitioning is applicable to data tables that cannot give the
>> exact
>> >> capacity of data, but can estimate 

Re: [DISCUSS] Hudi Reverse Streamer

2023-03-31 Thread Davidiam
Hello Vinoth,

Can you please unsubscribe me?  I have been trying to unsubscribe for months 
without success.

Kind Regards,
David

Sent from Outlook for Android

From: Vinoth Chandar 
Sent: Friday, March 31, 2023 5:09:52 AM
To: dev 
Subject: [DISCUSS] Hudi Reverse Streamer

Hi all,

Any interest in building a reverse streaming tool, that does the reverse of
what the DeltaStreamer tool does? It will read Hudi table incrementally
(only source) and write out the data to a variety of sinks - Kafka, JDBC
Databases, DFS.

This has come up many times with data warehouse users. Often times, they
want to use Hudi to speed up or reduce costs on their data ingestion and
ETL (using Spark/Flink), but want to move the derived data back into a data
warehouse or an operational database for serving.

What do you all think?

Thanks
Vinoth


Re:Re: [DISCUSS] Hudi data TTL

2023-03-31 Thread 符其军
Hi community, we have submitted RFC-65 Partition TTL Management in this pr: 
https://github.com/apache/hudi/pull/8062.Let me know if you have any 
questions or concerns with this proposal.
At 2022-10-21 14:42:10, "stream2000" <18889897...@163.com> wrote:
>Yes we can have a talk about it. We will try our best to write the RFC, maybe 
>publish it in a few weeks.
>
>
>> On Oct 21, 2022, at 10:18, JerryYue <272614...@qq.com.INVALID> wrote:
>> 
>> Looking forward to the RFC
>> It's a good idea, we also need hudi data TTL in some case
>> Do we have any plan or time to do this? We also had some simple designs to 
>> implement it
>> Maybe we can had a talk about it
>> 
>> 在 2022/10/20 上午9:47,“Bingeng 
>> Huang”> hbgstc...@gmail.com> 写入:
>> 
>>Looking forward to the RFC.
>>We can propose RFC about support TTL config using non-partition field 
>> after
>> 
>> 
>> 
>>sagar sumit  于2022年10月19日周三 14:42写道:
>> 
>>> +1 Very nice idea. Looking forward to the RFC!
>>> 
>>> On Wed, Oct 19, 2022 at 10:13 AM Shiyan Xu 
>>> wrote:
>>> 
 great proposal. Partition TTL is a good starting point. we can extend it
>>> to
 other TTL strategies like column-based, and make it customizable and
 pluggable. Looking forward to the RFC!
 
 On Wed, Oct 19, 2022 at 11:40 AM Jian Feng >>> 
 wrote:
 
> Good idea,
> this is definitely worth an  RFC
> btw should it only depend on Hudi's partition? I feel it should be a
>>> more
> common feature since sometimes customers' data can not update across
> partitions
> 
> 
> On Wed, Oct 19, 2022 at 11:07 AM stream2000 <18889897...@163.com>
>>> wrote:
> 
>> Hi all, we have implemented a partition based data ttl management,
 which
>> we can manage ttl for hudi partition by size, expired time and
>> sub-partition count. When a partition is detected as outdated, we use
>> delete partition interface to delete it, which will generate a
>>> replace
>> commit to mark the data as deleted. The real deletion will then done
>>> by
>> clean service.
>> 
>> 
>> If community is interested in this idea, maybe we can propose a RFC
>>> to
>> discuss it in detail.
>> 
>> 
>>> On Oct 19, 2022, at 10:06, Vinoth Chandar 
>>> wrote:
>>> 
>>> +1 love to discuss this on a RFC proposal.
>>> 
>>> On Tue, Oct 18, 2022 at 13:11 Alexey Kudinkin 
>> wrote:
>>> 
 That's a very interesting idea.
 
 Do you want to take a stab at writing a full proposal (in the form
 of
>> RFC)
 for it?
 
 On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang <
>>> hbgstc...@gmail.com
> 
 wrote:
 
> Hi all,
> 
> Do we have plan to integrate data TTL into HUDI, so we don't have
 to
> schedule a offline spark job to delete outdated data, just set a
 TTL
> config, then writer or some offline service will delete old data
>>> as
> expected.
> 
 
>> 
>> 
> 
> --
> *Jian Feng,冯健*
> Shopee | Engineer | Data Infrastructure
> 
 
 
 --
 Best,
 Shiyan
 
>>> 
>> 


Re: [DISCUSS] Hudi Reverse Streamer

2023-03-31 Thread Pratyaksh Sharma
+1 to this.

I can help drive some of this work.

On Fri, Mar 31, 2023 at 10:09 AM Prashant Wason 
wrote:

> Could be useful. Also, may be useful for backup / replication scenario
> (keeping a copy of data in alternate/cloud DC).
>
> HoodieDeltaStreamer already has the concept of "sources". This can be
> implemented as a "sink" concept.
>
> On Thu, Mar 30, 2023 at 8:12 PM Vinoth Chandar  wrote:
>
> > Essentially.
> >
> > Old architecture :(operational database) ==> some tool ==> (data
> > warehouse raw data) ==> SQL ETL ==> (data warehouse derived data)
> >
> > New architecture : (operational database) ==> Hudi delta Streamer ==>
> (Hudi
> > raw data) ==> Spark/Flink Hudi ETL ==> (Hudi derived data) ==> Hudi
> Reverse
> > Streamer ==> (Data Warehouse/Kafka/Operational Database)
> >
> > On Thu, Mar 30, 2023 at 8:09 PM Vinoth Chandar 
> wrote:
> >
> > > Hi all,
> > >
> > > Any interest in building a reverse streaming tool, that does the
> reverse
> > > of what the DeltaStreamer tool does? It will read Hudi table
> > incrementally
> > > (only source) and write out the data to a variety of sinks - Kafka,
> JDBC
> > > Databases, DFS.
> > >
> > > This has come up many times with data warehouse users. Often times,
> they
> > > want to use Hudi to speed up or reduce costs on their data ingestion
> and
> > > ETL (using Spark/Flink), but want to move the derived data back into a
> > data
> > > warehouse or an operational database for serving.
> > >
> > > What do you all think?
> > >
> > > Thanks
> > > Vinoth
> > >
> >
>