Hi All,
Currently, we are trying to pull data incrementally from our RDBMS
sources, however the way we are doing this is with HUDI is to create a
spark table on top of the JDBC source using [1] which writes raw data to an
HDFS dir. We then use DeltaStreamer dfs-source to write that to a HUDI
upsert COPY_ON_WRITE table.
However, I think it would be really helpful in such use cases
where DeltaStreamer had something like a JDBC-source instead of sqoop or
temp tables and then we could leave that in a continuous mode with a
timestamp column and an interval which allows us to express how frequently
DeltaStreamer should check for new updates or inserts on RDBMS.
1: CREATE TABLE mysql_temp_table
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:mysql://
data.source.mysql.com:3306/database?user=mysql_user&password=password&zeroDateTimeBehavior=CONVERT_TO_NULL
",
dbtable "database.table_name",
fetchSize "1000000",
partitionColumn "contact_id", lowerBound "1",
upperBound "2962429",
numPartitions "62"
);
Regards,
Taher Koitawala