Re: Last commit id/ts checkpoint for incremental pull

Vinoth Chandar Wed, 08 May 2019 15:09:37 -0700

sg. please keep us posted.

On Wed, May 8, 2019 at 12:02 AM Roshan Nair (Data Platform)
<[email protected]> wrote:


> Vinoth,
>
> Thanks. We are evaluating hudi at the moment for a very specific use case.
>
> We are also looking at hive 3.0, but, I still don't see a way to do
> incremental pulls on it. Though, we feel it might be possible to identify
> the new commits using some the internal apis, and we are checking that.
>
> We also came across Databricks Delta, and it seems to be conceptually
> similar to Hudi, though their storage format is not yet documented and
> generally internals documentation is lacking.
>
> We would be very much interested in Hudi for time travel capabilities as
> well, such as for building historical ml training data sets.
>
> Roshan
>
>
> On Tue, May 7, 2019 at 9:16 PM Vinoth Chandar <[email protected]> wrote:
>
> > Hi Roshan,
> >
> > Thanks for writing. Yes. the user needs to manage the _commit_time
> > watermark on the HiveIncrementalPuller path. Also you need to set the
> table
> > in incremental mode, providing a start commit_time and max_commits to
> pull
> > as documented. The DeltaStreamer tool will manage it for you
> automatically,
> > but it supports SparkSQL.
> >
> > At Uber, we have built some custom (yet simple) tools to do these steps
> in
> > your workflow scheduler.
> >
> > For e.g, let's say your commit timeline has c1, c2, c3 commits now and
> you
> > at at time t=0 (t corresponding to commit timestamp)
> >
> > 1) Use HoodieTableMetaClient and obtain the source table's commit
> timeline
> > and determine the range of commits to pull after t=0
> >      i.e c1, c2, c3
> > 2) Ask HiveIncrementalPuller to pull 3 commits from commit time=0
> > 3) Save c3 somewhere (mysql table or a folder on dfs)
> > 4) Before the next run, say there are new commits c4, c5. We make t=3 and
> > end up pulling 2 commits from c3 as above.
> >
> > We'd love to work with you, if you are interested in standardizing this
> > flow inside Hudi itself. :)
> >
> >
> >
> >
> > On Mon, May 6, 2019 at 11:50 PM Roshan Nair (Data Platform)
> > <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > We are trying to work out how to use hudi for incremental pulls. In our
> > > scenario, we would like to read from a hudi table incrementally, so
> that
> > > every subsequent read only reads new data.
> > >
> > > In the incremental hiveql example in the quickstart (
> > > http://hudi.incubator.apache.org/quickstart.html#incremental-hiveql),
> it
> > > appears that I can filter on _hoodie_commit_time to select only those
> > > records that have not been processed yet. Hudi will ensure snapshot
> > > isolation, so no new partial writes are visible to this reader.
> > >
> > > The next time I want an incremental set, how do I set the
> > > _hoodie_commit_time in the query?
> > >
> > > Is the expectation that the user will identify the max
> > _hoodie_commit_time
> > > in the result of the query and then use this to set the
> > _hoodie_commit_time
> > > filter for the next incremental query?
> > >
> > > Roshan
> > >
> >
>

Re: Last commit id/ts checkpoint for incremental pull

Reply via email to