Re: [DISCUSS] Release Manager for 1.0

2023-08-16 Thread Vinoth Chandar
Awesome! that was easy. lets go!

On Wed, Aug 16, 2023 at 5:32 AM sagar sumit  wrote:

> Hi Vinoth,
>
> 1.0 seems to be packed with exciting features.
> I would be glad to volunteer as the release manager.
>
> Regards,
> Sagar
>
> On Wed, Aug 16, 2023 at 5:24 PM Vinoth Chandar  wrote:
>
> > Hi PMC/Committers,
> >
> > We are looking for a volunteer to act as release manager for the 1.0
> > release.
> > https://cwiki.apache.org/confluence/display/HUDI/1.0+Execution+Planning
> >
> > Anyone interested?
> >
> > Thanks
> > Vinoth
> >
>


Re: [DISCUSS] Release Manager for 1.0

2023-08-16 Thread sagar sumit
Hi Vinoth,

1.0 seems to be packed with exciting features.
I would be glad to volunteer as the release manager.

Regards,
Sagar

On Wed, Aug 16, 2023 at 5:24 PM Vinoth Chandar  wrote:

> Hi PMC/Committers,
>
> We are looking for a volunteer to act as release manager for the 1.0
> release.
> https://cwiki.apache.org/confluence/display/HUDI/1.0+Execution+Planning
>
> Anyone interested?
>
> Thanks
> Vinoth
>


[DISCUSS] Release Manager for 1.0

2023-08-16 Thread Vinoth Chandar
Hi PMC/Committers,

We are looking for a volunteer to act as release manager for the 1.0
release.
https://cwiki.apache.org/confluence/display/HUDI/1.0+Execution+Planning

Anyone interested?

Thanks
Vinoth


Re: DISCUSS Hudi 1.x plans

2023-08-16 Thread Vinoth Chandar
Hello everyone,

We have been doing a lot of foundational design, prototyping work and I
have outlined an execution plan here.
https://cwiki.apache.org/confluence/display/HUDI/1.0+Execution+Planning

Look forward to contributions!


On Wed, May 10, 2023 at 4:14 PM Sivabalan  wrote:

> Great! Left some feedback.
>
> On Wed, 10 May 2023 at 06:56, Vinoth Chandar  wrote:
> >
> > All - the RFC is up here. Please comment on the PR or use the dev list to
> > discuss ideas.
> > https://github.com/apache/hudi/pull/8679/
> >
> > On Mon, May 8, 2023 at 11:43 PM Vinoth Chandar 
> wrote:
> >
> > > I have claimed RFC-69, per our process.
> > >
> > > On Mon, May 8, 2023 at 9:19 PM Vinoth Chandar 
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I have been consolidating all our progress on Hudi and putting
> together a
> > >> proposal for Hudi 1.x vision and a concrete plan for the first
> version 1.0.
> > >>
> > >> Will plan to open up the RFC to gather ideas across the community in
> > >> coming days.
> > >>
> > >> Thanks
> > >> Vinoth
> > >>
> > >
>
>
>
> --
> Regards,
> -Sivabalan
>


Re: [Feature Request] Support "faking" hudi commit time with the value of some field in the record

2023-08-16 Thread Vinoth Chandar
(sorry for the late reply)

Hi - the commit time can be a logical time as well, a lot of tests work
this way. There may be some table features (e.g time based cleaning) that
may not work, but those are more convenience ones anyway.

I assume, the consumer would process all events at the required source
timestamp boundary to achieve this?

I am happy to chat/help scope the changes more.



On Wed, Aug 2, 2023 at 1:17 PM Joseph Thaidigsman
 wrote:

> Hello,
>
> We have a use-case where we have persisted the full CDC changelog for some
> tables in s3 and want to be able to bootstrap hudi tables with the
> changelog data and then be able to time-travel the hudi table to get
> snapshot views of the table on dates prior to bootstrapping. In our
> changelog, we have the timestamp associated with the
> inserts/updates/deletes, so the data to achieve this is present. If we had
> a live consumer processing those events in real-time and writing them to a
> hudi table, then we would be able to achieve this, but because we are
> instead creating the hudi table from a single batch job, we are unable to
> achieve it despite processing the same exact data, since time-travel is all
> based on the hudi commit time.
>
> Aside from our specific use-case for bootstrapping tables, this would be
> useful for real-time CDC consumers as well.  Currently, there is no way to
> guarantee the accuracy of the time-travel operation as it relates to
> reflecting the state of the upstream database table at a given point in
> time. For example, say you have some downstream batch pipelines that want
> to perform some aggregations based on production database tables at a fixed
> point each day. In the case of lag or outage on the consumer-side, when the
> consumer restarts, we have a large gap in hudi commit time and are unable
> to time-travel to the exact moment that the downstream pipelines expect to
> reflect the database table state.
>
> If the hudi writer instead supported picking some field from the CDC record
> as the value for the hudi commit time, then the consumer could process the
> events at any time and the time-travel functionality would be the same
> regardless of consumption time. This would make the writer idempotent in a
> way that it currently lacks, guaranteeing consistent results for downstream
> pipelines.
>
> Original Slack Thread:
> https://apache-hudi.slack.com/archives/C4D716NPQ/p1690583690053259
>


Re: [DISCUSS] Hudi Reverse Streamer

2023-08-16 Thread Vinoth Chandar
Hi Pratyaksh,

Are you still actively driving this?

On Tue, Jul 11, 2023 at 2:18 PM Pratyaksh Sharma 
wrote:

> Update: I will be raising the initial draft of RFC in the next couple of
> days.
>
> On Thu, Jun 15, 2023 at 2:28 AM Rajesh Mahindra 
> wrote:
>
> > Great. We also need it for use cases of loading data into warehouses, and
> > would love to help.
> >
> > On Wed, Jun 14, 2023 at 9:06 AM Pratyaksh Sharma 
> > wrote:
> >
> > > Hi,
> > >
> > > I missed this email earlier. Sure let me start an RFC this week and we
> > can
> > > take it from there.
> > >
> > > On Wed, Jun 14, 2023 at 9:20 PM Nicolas Paris <
> nicolas.pa...@riseup.net>
> > > wrote:
> > >
> > > > Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a
> > use
> > > > case to do hudi => Kafka and would enjoy building a more general
> tool.
> > > >
> > > > However we need a rfc basis to start some effort in the right way
> > > >
> > > > On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar <
> > > > mail.vinoth.chan...@gmail.com> wrote:
> > > > >Cool. lets draw up a RFC for this? @pratyaksh - do you want to start
> > > one,
> > > > >given you expressed interest?
> > > > >
> > > > >On Mon, Apr 10, 2023 at 7:32 PM Léo Biscassi <
> leo.bisca...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> +1
> > > > >> This would be great!
> > > > >>
> > > > >> Cheers,
> > > > >>
> > > > >> On Mon, Apr 3, 2023 at 3:00 PM Pratyaksh Sharma <
> > > pratyaks...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Vinoth,
> > > > >> >
> > > > >> > I am aligned with the first reason that you mentioned. Better to
> > > have
> > > > a
> > > > >> > separate tool to take care of this.
> > > > >> >
> > > > >> > On Mon, Apr 3, 2023 at 9:01 PM Vinoth Chandar <
> > > > >> > mail.vinoth.chan...@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > +1
> > > > >> > >
> > > > >> > > I was thinking that we add a new utility and NOT extend
> > > > DeltaStreamer
> > > > >> by
> > > > >> > > adding a Sink interface, for the following reasons
> > > > >> > >
> > > > >> > > - It will make it look like a generic Source => Sink ETL tool,
> > > > which is
> > > > >> > > actually not our intention to support on Hudi. There are
> plenty
> > of
> > > > good
> > > > >> > > tools for that out there.
> > > > >> > > - the config management can get bit hard to understand, since
> we
> > > > >> overload
> > > > >> > > ingest and reverse ETL into a single tool. So break it off at
> > > > use-case
> > > > >> > > level?
> > > > >> > >
> > > > >> > > Thoughts?
> > > > >> > >
> > > > >> > > David:  PMC does not have control over that. Please see
> > > unsubscribe
> > > > >> > > instructions here.
> > https://hudi.apache.org/community/get-involved
> > > > >> > > Love to keep this thread about reverse streamer discussion. So
> > > > kindly
> > > > >> > fork
> > > > >> > > another thread if you want to discuss unsubscribing.
> > > > >> > >
> > > > >> > > On Fri, Mar 31, 2023 at 1:47 AM Davidiam <
> > david.rosa...@gmail.com
> > > >
> > > > >> > wrote:
> > > > >> > >
> > > > >> > > > Hello Vinoth,
> > > > >> > > >
> > > > >> > > > Can you please unsubscribe me?  I have been trying to
> > > unsubscribe
> > > > for
> > > > >> > > > months without success.
> > > > >> > > >
> > > > >> > > > Kind Regards,
> > > > >> > > > David
> > > > >> > > >
> > > > >> > > > Sent from Outlook for Android
> > > > >> > > > 
> > > > >> > > > From: Vinoth Chandar 
> > > > >> > > > Sent: Friday, March 31, 2023 5:09:52 AM
> > > > >> > > > To: dev 
> > > > >> > > > Subject: [DISCUSS] Hudi Reverse Streamer
> > > > >> > > >
> > > > >> > > > Hi all,
> > > > >> > > >
> > > > >> > > > Any interest in building a reverse streaming tool, that does
> > the
> > > > >> > reverse
> > > > >> > > of
> > > > >> > > > what the DeltaStreamer tool does? It will read Hudi table
> > > > >> incrementally
> > > > >> > > > (only source) and write out the data to a variety of sinks -
> > > > Kafka,
> > > > >> > JDBC
> > > > >> > > > Databases, DFS.
> > > > >> > > >
> > > > >> > > > This has come up many times with data warehouse users. Often
> > > > times,
> > > > >> > they
> > > > >> > > > want to use Hudi to speed up or reduce costs on their data
> > > > ingestion
> > > > >> > and
> > > > >> > > > ETL (using Spark/Flink), but want to move the derived data
> > back
> > > > into
> > > > >> a
> > > > >> > > data
> > > > >> > > > warehouse or an operational database for serving.
> > > > >> > > >
> > > > >> > > > What do you all think?
> > > > >> > > >
> > > > >> > > > Thanks
> > > > >> > > > Vinoth
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >>
> > > > >> --
> > > > >> *Léo Biscassi*
> > > > >> Blog - https://leobiscassi.com
> > > > >>
> > > > >>-
> > > > >>
> > > >
> > >
> >
> >
> > --
> > Take Care,
> > Rajesh Mahindra
> >
>


Re: Record level index with not unique keys

2023-08-16 Thread Vinoth Chandar
Hi,

yes the indexing DAG can support this today and even if not, it can be
easily fixed. Main issue would be how we encode the mapping well.
for e.g if we want map from user_id to all events that belong to the user,
we need a different, scalable way of storing this mapping.

I can organize this work under the 1.0 stream, if you are interested in
driving.

Thanks
Vinoth


On Thu, Jul 13, 2023 at 1:09 PM nicolas paris 
wrote:

> Hello Prashant, thanks for your time.
>
>
> > With non unique keys how would tagging of records (for updates /
> deletes) work?
>
> Currently both GLOBAL_SIMPLE/BLOOM work out of the box in the mentioned
> context. See below pyspark script and results. As for the
> implementation, the tagLocationBacktoRecords returns a rdd of
> HoodieRecord with (key/part/location), and it can contain duplicate
> keys (then multiple records for same key).
>
> ```
> tableName = "test_global_bloom"
> basePath = f"/tmp/{tableName}"
>
> hudi_options = {
> "hoodie.table.name": tableName,
> "hoodie.datasource.write.recordkey.field": "event_id",
> "hoodie.datasource.write.partitionpath.field": "part",
> "hoodie.datasource.write.table.name": tableName,
> "hoodie.datasource.write.precombine.field": "ts",
> "hoodie.datasource.write.hive_style_partitioning": "true",
> "hoodie.datasource.hive_sync.enable": "false",
> "hoodie.metadata.enable": "true",
> "hoodie.index.type": "GLOBAL_BLOOM", # GLOBAL_SIMPLE works as well
> }
>
> # LET'S GEN DUPLS
> mode="overwrite"
> df =spark.sql("""select '1' as event_id, '2' as ts, '2' as part UNION
>  select '1' as event_id, '3' as ts, '3' as part UNION
>  select '1' as event_id, '2' as ts, '3' as part UNION
>  select '2' as event_id, '2' as ts, '3' as part""")
> df.write.format("hudi").options(**hudi_options).option("hoodie.datasour
> ce.write.operation", "BULK_INSERT").mode(mode).save(basePath)
> spark.read.format("hudi").load(basePath).select("event_id",
> "ts","part").show()
> # ++---++
> # |event_id| ts|part|
> # ++---++
> # |   1|  3|   3|
> # |   1|  2|   3|
> # |   2|  2|   3|
> # |   1|  2|   2|
> # ++---++
>
> # UPDATE
> mode="append"
> spark.sql("select '1' as event_id, '20' as ts, '4' as
> part").write.format("hudi").options(**hudi_options).option("hoodie.data
> source.write.operation", "UPSERT").mode(mode).save(basePath)
> spark.read.format("hudi").load(basePath).select("event_id",
> "ts","part").show()
> # ++---++
> # |event_id| ts|part|
> # ++---++
> # |   1| 20|   4|
> # |   1| 20|   4|
> # |   1| 20|   4|
> # |   2|  2|   3|
> # ++---++
>
> # DELETE
> mode="append"
> spark.sql("select 1 as
> event_id").write.format("hudi").options(**hudi_options).option("hoodie.
> datasource.write.operation", "DELETE").mode(mode).save(basePath)
> spark.read.format("hudi").load(basePath).select("event_id",
> "ts","part").show()
> # ++---++
> # |event_id| ts|part|
> # ++---++
> # |   2|  2|   3|
> # ++---++
> ```
>
>
> > How would record Index know which mapping of the array to
> return for a given record key?
>
> As well as GLOBAL_SIMPLE/BLOOM, for a given record key, the RLI would
> return a list of mapping. Then the operation (update, delete, FCOW ...)
> would apply to each location.
>
> To illustrate, we could get something like this in the MDT:
>
> |event_id:1|[
>  {part=2, -5811947225812876253, -6812062179961430298, 0,
> 1689147210233},
>  {part=3, -711947225812876253, -8812062179961430298, 1,
> 1689147210233},
>  {part=3, -1811947225812876253, -2812062179961430298, 0,
> 1689147210233}
>  ]|
>
>
> On Thu, 2023-07-13 at 10:17 -0700, Prashant Wason wrote:
> > Hi Nicolas,
> >
> > The RI feature is designed for max performance as it is at a record-
> > count
> > scale. Hence, the schema is simplified and minimized.
> >
> > With non unique keys how would tagging of records (for updates /
> > deletes)
> > work? How would record Index know which mapping of the array to
> > return for
> > a given record key?
> >
> > Thanks
> > Prashant
> >
> >
> >
> > On Wed, Jul 12, 2023 at 2:02 AM nicolas paris
> > 
> > wrote:
> >
> > > hi there,
> > >
> > > Just tested preview of RLI (rfc-08), amazing feature. Soon the fast
> > > COW
> > > (rfc-68) will be based on RLI to get the parquet offsets and allow
> > > targeting parquet row groups.
> > >
> > > RLI is a global index, therefore it assumes the hudi key is present
> > > in
> > > at most one parquet file. As a result in the MDT, the RLI is of
> > > type
> > > struct, and there is a 1:1 mapping w/ a given file.
> > >
> > > Type:
> > >|-- recordIndexMetadata: struct (nullable = true)
> > >||-- partition: string (nullable = false)
> > >||-- fileIdHighBits: long (nullable = false)
> > >||-- fileIdLowBits: long (nullable = false)
> > >|

Re: [DISCUSS] Should we support a service to manage all deltastreamer jobs?

2023-08-16 Thread Vinoth Chandar
+1 there are RFCs on table management services, but not specific to
deltastreamer itself.

Are you proposing building something specific to that?

On Wed, Jun 14, 2023 at 8:26 AM Pratyaksh Sharma 
wrote:

> Hi,
>
> Personally I am in favour of creating such a UI where monitoring and
> managing configurations is just a click away. That makes life a lot easier
> for users. So +1 on the proposal.
>
> I remember the work for it had started long back around 2019. You can check
> this RFC
> <
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=130027233
> >
> for your reference. I am not sure why this work could not continue though.
>
> On Wed, Jun 14, 2023 at 4:28 PM 孔维 <18701146...@163.com> wrote:
>
> > Hi, team,
> >
> >
> > Background:
> > More and more hudi accesses use deltastreamer, resulting in a large
> number
> > of deltastreamer jobs that need to be managed. In our company, we also
> > manage a large number of deltastreamer jobs by ourselves, and there is a
> > lot of operation and maintenance management and monitoring work.
> > If we can provide such a deltastreamer service to create, manage, and
> > monitor all tasks in a unified manner, it can greatly reduce the
> management
> > pressure of deltastreamer, and at the same time lower the threshold for
> > using deltastreamer, which is conducive to the promotion and use of
> > deltastreamer.
> > At the same time, considering that deltastreamer already supports
> > configuration hot update capability [
> > https://github.com/apache/hudi/pull/8807], we can offer configuration
> hot
> > update capability based on the feature, and make configuration changes
> > without restarting the job.
> >
> >
> > We hope to provide:
> > Provides a web UI to support creation, management and monitoring of
> > deltastreamer tasks
> > Using configuration hot update capability to provide timely configuration
> > change capability
> >
> >
> > I don't know whether such a service is in line with the evolution of the
> > community, and I hope to receive your reply!
> >
> >
> > Best Regards
>