Hi All,
         Currently, we are trying to pull data incrementally from our RDBMS
sources, however the way we are doing this is with HUDI is to create a
spark table on top of the JDBC source using [1] which writes raw data to an
HDFS dir. We then use DeltaStreamer dfs-source to write that to a HUDI
upsert COPY_ON_WRITE table.

          However, I think it would be really helpful in such use cases
where DeltaStreamer had something like a JDBC-source instead of sqoop or
temp tables and then we could leave that in a continuous mode with a
timestamp column and an interval which allows us to express how frequently
DeltaStreamer should check for new updates or inserts on RDBMS.

1: CREATE TABLE mysql_temp_table
USING org.apache.spark.sql.jdbc
OPTIONS (
     url  "jdbc:mysql://
data.source.mysql.com:3306/database?user=mysql_user&password=password&zeroDateTimeBehavior=CONVERT_TO_NULL
",
     dbtable "database.table_name",
     fetchSize "1000000",
     partitionColumn "contact_id", lowerBound "1",
upperBound "2962429",
numPartitions "62"
);

Regards,
Taher Koitawala

Reply via email to