[ 
https://issues.apache.org/jira/browse/HUDI-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325958#comment-17325958
 ] 

Vinoth Chandar commented on HUDI-251:
-------------------------------------

On 2, I think we have to enforce some sorting when limiting (if you are pulling 
very incrementally, hopefully it won't be as bad) and given we persist, to 
derive the checkpoint, we will pick the maximum value of the `ckpt` column 
value each time and we should be okay. 

>where ckpt > last_ckpt order by ckpt desc limit x

yes. we are on the same page. We have to sort and paginate like this.

 

>Can you please elaborate more on the tailing mechanism?

What I meant was there could be scenarios, where we could still miss data in 
this JDBC based approach. We should clearly document these.

For e.g As we fetch `ckpt > 10` there could be a long running transaction that 
just committed an earlier `ckpt=8` value. We would just fetch all records from 
10 and move on. Let's think through also other issues like this? I think its 
okay, since everybody understands JDBC pulling is more for convenience than 
anything, works correctly when you don't run into these cases. Does that make 
sense?

 

 

> JDBC incremental load to HUDI with DeltaStreamer
> ------------------------------------------------
>
>                 Key: HUDI-251
>                 URL: https://issues.apache.org/jira/browse/HUDI-251
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: DeltaStreamer
>    Affects Versions: 0.9.0
>            Reporter: Taher Koitawala
>            Assignee: Purushotham Pushpavanthar
>            Priority: Trivial
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Mirroring RDBMS to HUDI is one of the most basic use cases of HUDI. Hence, 
> for such use cases, DeltaStreamer should provide inbuilt support.
> DeltaSteamer should accept something like jdbc-source.properties where users 
> can define the RDBMS connection properties along with a timestamp column and 
> an interval which allows users to express how frequently HUDI should check 
> with RDBMS data source for new inserts or updates.
> Details are documented in RFC-14
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to