Hi Taher, 

I am fully onboard on this. This is such a frequently asked question and having 
it all doable with a simple DeltaStreamer command would be really powerful.  

+1 

- Vinoth 

On 2019/09/14 05:51:05, Taher Koitawala <[email protected]> wrote: 
> Hi All,
>          Currently, we are trying to pull data incrementally from our RDBMS
> sources, however the way we are doing this is with HUDI is to create a
> spark table on top of the JDBC source using [1] which writes raw data to an
> HDFS dir. We then use DeltaStreamer dfs-source to write that to a HUDI
> upsert COPY_ON_WRITE table.
> 
>           However, I think it would be really helpful in such use cases
> where DeltaStreamer had something like a JDBC-source instead of sqoop or
> temp tables and then we could leave that in a continuous mode with a
> timestamp column and an interval which allows us to express how frequently
> DeltaStreamer should check for new updates or inserts on RDBMS.
> 
> 1: CREATE TABLE mysql_temp_table
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>      url  "jdbc:mysql://
> data.source.mysql.com:3306/database?user=mysql_user&password=password&zeroDateTimeBehavior=CONVERT_TO_NULL
> ",
>      dbtable "database.table_name",
>      fetchSize "1000000",
>      partitionColumn "contact_id", lowerBound "1",
> upperBound "2962429",
> numPartitions "62"
> );
> 
> Regards,
> Taher Koitawala
> 

Reply via email to