[DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

Taher Koitawala Fri, 13 Sep 2019 22:51:44 -0700

Hi All,
         Currently, we are trying to pull data incrementally from our RDBMS
sources, however the way we are doing this is with HUDI is to create a
spark table on top of the JDBC source using [1] which writes raw data to an
HDFS dir. We then use DeltaStreamer dfs-source to write that to a HUDI
upsert COPY_ON_WRITE table.


          However, I think it would be really helpful in such use cases
where DeltaStreamer had something like a JDBC-source instead of sqoop or
temp tables and then we could leave that in a continuous mode with a
timestamp column and an interval which allows us to express how frequently
DeltaStreamer should check for new updates or inserts on RDBMS.

1: CREATE TABLE mysql_temp_table
USING org.apache.spark.sql.jdbc
OPTIONS (
     url  "jdbc:mysql://
data.source.mysql.com:3306/database?user=mysql_user&password=password&zeroDateTimeBehavior=CONVERT_TO_NULL
",
     dbtable "database.table_name",
     fetchSize "1000000",
     partitionColumn "contact_id", lowerBound "1",
upperBound "2962429",
numPartitions "62"
);

Regards,
Taher Koitawala

[DISCUSS] [VOTE] JDBC incremental load with DeltaStreamer

Reply via email to