taherk77 commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with 
DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/917#issuecomment-534626027
 
 
   
   > this can be implemented by passing `--checkpoint null` or `--full-load` 
flag? This is actually a general issue for all sources.. it would be good to 
open a new JIRA for this and tackle separately.. For e.g, even if you have 
files on DFS, you want to probably have an option to do this.. For this PR, we 
can just focus on incremental pulling where the first run without a checkpoint, 
pulls the entire table?
   
   So here is the kind of algorithm that I think of implementing
    
   with load params (incremental_ column and interval x)
   
   1. First, run do an entire pull of the table, write max(incremental_column) 
as last_val to the checkpoint.
   2. second schedule after interval x  run with Spark JDCB predicate pushdown 
do "select * from table where incremental_column > last_val" write data this 
rdd and the again write max(incremental_column) as checkpoint again and keep 
going.
   
   
    
   
   > 
   > > the interval we should be pulling the data every interval.
   > 
   > On the interval maybe I was vague. apologies. What I meant was the 
frequency at which we run DeltaStreamer is controlled by the user in 
non-continuous mode and #921 just added a flag to control this in continuous 
mode. Don't think we need to worry about it in this PR?
   
   On interval meaning, if a thread that keeps running until killed and keeps 
scheduling the JDBC job after the mentioned interval. I had some concerns here, 
let's say a JDBC job takes 10 mins to complete but user mentions 5 mins 
interval that means jobs will keep on piling right? 
   
   Last concern, ever time we write JDBC jobs in spark standard practise if 
   
   spark.read().jdbc("url,"table","someTableColumn",1,10,connectionProps) here 
1 is lowerbound and 10 is upperbound with this each Spark executor does a range 
query and pulls data independently. Without the lowerbound and upperbound all 
data will be pulled in one executor, which will make this process really slow. 
So how do we incorporate this in the code?
   
   even if we call a repartition on spark.read.jdbc that means all data will 
still come to one executor and then get repartition from there. 
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to