[
https://issues.apache.org/jira/browse/HUDI-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325156#comment-17325156
]
Sagar Sumit commented on HUDI-251:
----------------------------------
# Yes. The sequence is: fetch() -> persist() -> checkpoint() -> form the pair
with persisted dataset and checkpoint -> unpersist() -> return pair.
# That's a valid point. What if the number of records since the last
checkpoint is greater than sourceLimit? Even if we order by the checkpoint
column, we will miss some records. That means we need some sort of pagination
on top of sorting (doing multiple select from..where ckpt > last_ckpt order by
ckpt desc limit x). Won't this be costlier than single select query without
limit?
# Can you please elaborate more on the tailing mechanism? Is it something
related to pagination point I mentioned above?
> JDBC incremental load to HUDI with DeltaStreamer
> ------------------------------------------------
>
> Key: HUDI-251
> URL: https://issues.apache.org/jira/browse/HUDI-251
> Project: Apache Hudi
> Issue Type: New Feature
> Components: DeltaStreamer
> Affects Versions: 0.9.0
> Reporter: Taher Koitawala
> Assignee: Purushotham Pushpavanthar
> Priority: Trivial
> Labels: pull-request-available
> Fix For: 0.9.0
>
> Time Spent: 1h 40m
> Remaining Estimate: 0h
>
> Mirroring RDBMS to HUDI is one of the most basic use cases of HUDI. Hence,
> for such use cases, DeltaStreamer should provide inbuilt support.
> DeltaSteamer should accept something like jdbc-source.properties where users
> can define the RDBMS connection properties along with a timestamp column and
> an interval which allows users to express how frequently HUDI should check
> with RDBMS data source for new inserts or updates.
> Details are documented in RFC-14
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller
--
This message was sent by Atlassian Jira
(v8.3.4#803005)