Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

Taher Koitawala Sun, 15 Sep 2019 23:16:12 -0700

Thank you all for your support. JIRA filed at
https://issues.apache.org/jira/browse/HUDI-251


Regards,
Taher Koitawala

On Mon, Sep 16, 2019 at 11:34 AM Taher Koitawala <[email protected]> wrote:

> Since everyone is fully onboard. I am creating a JIRA to track this.
>
> On Sun, Sep 15, 2019 at 9:47 AM [email protected] <[email protected]>
> wrote:
>
>>
>> +1. Agree with everyone's point. Go for it Taher !!
>> Balaji.V    On Saturday, September 14, 2019, 07:44:04 PM PDT, Bhavani
>> Sudha Saktheeswaran <[email protected]> wrote:
>>
>>  +1 I  think adding new sources to DeltaStreamer is really valuable.
>>
>> Thanks,
>> Sudha
>>
>> On Sat, Sep 14, 2019 at 7:52 AM vino yang <[email protected]> wrote:
>>
>> > Hi Taher,
>> >
>> > IMO, it's a good supplement to Hudi.
>> >
>> > So +1 from my side.
>> >
>> > Vinoth Chandar <[email protected]> 于2019年9月14日周六 下午10:23写道：
>> >
>> > > Hi Taher,
>> > >
>> > > I am fully onboard on this. This is such a frequently asked question
>> and
>> > > having it all doable with a simple DeltaStreamer command would be
>> really
>> > > powerful.
>> > >
>> > > +1
>> > >
>> > > - Vinoth
>> > >
>> > > On 2019/09/14 05:51:05, Taher Koitawala <[email protected]> wrote:
>> > > > Hi All,
>> > > >          Currently, we are trying to pull data incrementally from
>> our
>> > > RDBMS
>> > > > sources, however the way we are doing this is with HUDI is to
>> create a
>> > > > spark table on top of the JDBC source using [1] which writes raw
>> data
>> > to
>> > > an
>> > > > HDFS dir. We then use DeltaStreamer dfs-source to write that to a
>> HUDI
>> > > > upsert COPY_ON_WRITE table.
>> > > >
>> > > >          However, I think it would be really helpful in such use
>> cases
>> > > > where DeltaStreamer had something like a JDBC-source instead of
>> sqoop
>> > or
>> > > > temp tables and then we could leave that in a continuous mode with a
>> > > > timestamp column and an interval which allows us to express how
>> > > frequently
>> > > > DeltaStreamer should check for new updates or inserts on RDBMS.
>> > > >
>> > > > 1: CREATE TABLE mysql_temp_table
>> > > > USING org.apache.spark.sql.jdbc
>> > > > OPTIONS (
>> > > >      url  "jdbc:mysql://
>> > > >
>> > >
>> >
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__data.source.mysql.com&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=oyPDRKU5b-LuEWWyf8gacx4mFFydIGdyS50OKdxizX0&m=kd2JZkFO9u_nWk8s__l1rNlfZ0cQ_zXOjURNBNmmJo4&s=zIAG-Ct3xm-8XBHg7Gv4mxPF7YpQJ5wxWTarYnJlJDE&e=
>> >
>> :3306/database?user=mysql_user&password=password&zeroDateTimeBehavior=CONVERT_TO_NULL
>> > > > ",
>> > > >      dbtable "database.table_name",
>> > > >      fetchSize "1000000",
>> > > >      partitionColumn "contact_id", lowerBound "1",
>> > > > upperBound "2962429",
>> > > > numPartitions "62"
>> > > > );
>> > > >
>> > > > Regards,
>> > > > Taher Koitawala
>> > > >
>> > >
>> >
>
>

Re: [DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

Reply via email to