Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

Vinoth Chandar Mon, 16 Sep 2019 11:18:54 -0700

It should work like any other source and none of the others are aware if
whether deltaStreamer is running in continuous mode or not.


Simplistically, it just needs a config to denote an incremental field - say
`_last_modified_at` and we use that as a checkpoint to tail that table
by including a `where _last_modified_at > last_checkpoint`. using the
spark.read.jdbc("") datasource..

You can look at HoodieIncrSource for inspiration..




On Mon, Sep 16, 2019 at 9:02 AM Taher Koitawala <[email protected]> wrote:

> Will this be the same implementation as session.read.jdbc("") and then call
> this code continuously like how we are running HUDI in continuous mode?
>
> On Mon, Sep 16, 2019 at 9:09 PM Vinoth Chandar <[email protected]> wrote:
>
> > Thanks, Taher! Any takers for driving this? This is something I would be
> > very interested in getting involved with. Dont have the bandwidth atm :/
> >
> > On Sun, Sep 15, 2019 at 11:15 PM Taher Koitawala <[email protected]>
> > wrote:
> >
> > > Thank you all for your support. JIRA filed at
> > > https://issues.apache.org/jira/browse/HUDI-251
> > >
> > > Regards,
> > > Taher Koitawala
> > >
> > > On Mon, Sep 16, 2019 at 11:34 AM Taher Koitawala <[email protected]>
> > > wrote:
> > >
> > > > Since everyone is fully onboard. I am creating a JIRA to track this.
> > > >
> > > > On Sun, Sep 15, 2019 at 9:47 AM [email protected] <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > >>
> > > >> +1. Agree with everyone's point. Go for it Taher !!
> > > >> Balaji.V    On Saturday, September 14, 2019, 07:44:04 PM PDT,
> Bhavani
> > > >> Sudha Saktheeswaran <[email protected]> wrote:
> > > >>
> > > >>  +1 I  think adding new sources to DeltaStreamer is really valuable.
> > > >>
> > > >> Thanks,
> > > >> Sudha
> > > >>
> > > >> On Sat, Sep 14, 2019 at 7:52 AM vino yang <[email protected]>
> > > wrote:
> > > >>
> > > >> > Hi Taher,
> > > >> >
> > > >> > IMO, it's a good supplement to Hudi.
> > > >> >
> > > >> > So +1 from my side.
> > > >> >
> > > >> > Vinoth Chandar <[email protected]> 于2019年9月14日周六 下午10:23写道：
> > > >> >
> > > >> > > Hi Taher,
> > > >> > >
> > > >> > > I am fully onboard on this. This is such a frequently asked
> > question
> > > >> and
> > > >> > > having it all doable with a simple DeltaStreamer command would
> be
> > > >> really
> > > >> > > powerful.
> > > >> > >
> > > >> > > +1
> > > >> > >
> > > >> > > - Vinoth
> > > >> > >
> > > >> > > On 2019/09/14 05:51:05, Taher Koitawala <[email protected]>
> > wrote:
> > > >> > > > Hi All,
> > > >> > > >          Currently, we are trying to pull data incrementally
> > from
> > > >> our
> > > >> > > RDBMS
> > > >> > > > sources, however the way we are doing this is with HUDI is to
> > > >> create a
> > > >> > > > spark table on top of the JDBC source using [1] which writes
> raw
> > > >> data
> > > >> > to
> > > >> > > an
> > > >> > > > HDFS dir. We then use DeltaStreamer dfs-source to write that
> to
> > a
> > > >> HUDI
> > > >> > > > upsert COPY_ON_WRITE table.
> > > >> > > >
> > > >> > > >          However, I think it would be really helpful in such
> use
> > > >> cases
> > > >> > > > where DeltaStreamer had something like a JDBC-source instead
> of
> > > >> sqoop
> > > >> > or
> > > >> > > > temp tables and then we could leave that in a continuous mode
> > > with a
> > > >> > > > timestamp column and an interval which allows us to express
> how
> > > >> > > frequently
> > > >> > > > DeltaStreamer should check for new updates or inserts on
> RDBMS.
> > > >> > > >
> > > >> > > > 1: CREATE TABLE mysql_temp_table
> > > >> > > > USING org.apache.spark.sql.jdbc
> > > >> > > > OPTIONS (
> > > >> > > >      url  "jdbc:mysql://
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__data.source.mysql.com&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=kd2JZkFO9u_nWk8s__l1rNlfZ0cQ_zXOjURNBNmmJo4&s=zIAG-Ct3xm-8XBHg7Gv4mxPF7YpQJ5wxWTarYnJlJDE&e=
> > > >> >
> > > >>
> > >
> >
> :3306/database?user=mysql_user&password=password&zeroDateTimeBehavior=CONVERT_TO_NULL
> > > >> > > > ",
> > > >> > > >      dbtable "database.table_name",
> > > >> > > >      fetchSize "1000000",
> > > >> > > >      partitionColumn "contact_id", lowerBound "1",
> > > >> > > > upperBound "2962429",
> > > >> > > > numPartitions "62"
> > > >> > > > );
> > > >> > > >
> > > >> > > > Regards,
> > > >> > > > Taher Koitawala
> > > >> > > >
> > > >> > >
> > > >> >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

Reply via email to