Hi All, Currently, we are trying to pull data incrementally from our RDBMS sources, however the way we are doing this is with HUDI is to create a spark table on top of the JDBC source using [1] which writes raw data to an HDFS dir. We then use DeltaStreamer dfs-source to write that to a HUDI upsert COPY_ON_WRITE table.
However, I think it would be really helpful in such use cases where DeltaStreamer had something like a JDBC-source instead of sqoop or temp tables and then we could leave that in a continuous mode with a timestamp column and an interval which allows us to express how frequently DeltaStreamer should check for new updates or inserts on RDBMS. 1: CREATE TABLE mysql_temp_table USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:mysql:// data.source.mysql.com:3306/database?user=mysql_user&password=password&zeroDateTimeBehavior=CONVERT_TO_NULL ", dbtable "database.table_name", fetchSize "1000000", partitionColumn "contact_id", lowerBound "1", upperBound "2962429", numPartitions "62" ); Regards, Taher Koitawala