taherk77 commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer URL: https://github.com/apache/incubator-hudi/pull/917#issuecomment-534626027 > this can be implemented by passing `--checkpoint null` or `--full-load` flag? This is actually a general issue for all sources.. it would be good to open a new JIRA for this and tackle separately.. For e.g, even if you have files on DFS, you want to probably have an option to do this.. For this PR, we can just focus on incremental pulling where the first run without a checkpoint, pulls the entire table? So here is the kind of algorithm that I think of implementing with load params (incremental_ column and interval x) 1. First, run do an entire pull of the table, write max(incremental_column) as last_val to the checkpoint. 2. second schedule after interval x run with Spark JDCB predicate pushdown do "select * from table where incremental_column > last_val" write data this rdd and the again write max(incremental_column) as checkpoint again and keep going. > > > the interval we should be pulling the data every interval. > > On the interval maybe I was vague. apologies. What I meant was the frequency at which we run DeltaStreamer is controlled by the user in non-continuous mode and #921 just added a flag to control this in continuous mode. Don't think we need to worry about it in this PR? On interval meaning, if a thread that keeps running until killed and keeps scheduling the JDBC job after the mentioned interval. I had some concerns here, let's say a JDBC job takes 10 mins to complete but user mentions 5 mins interval that means jobs will keep on piling right? Last concern, ever time we write JDBC jobs in spark standard practise if spark.read().jdbc("url,"table","someTableColumn",1,10,connectionProps) here 1 is lowerbound and 10 is upperbound with this each Spark executor does a range query and pulls data independently. Without the lowerbound and upperbound all data will be pulled in one executor, which will make this process really slow. So how do we incorporate this in the code? even if we call a repartition on spark.read.jdbc that means all data will still come to one executor and then get repartition from there.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services