Vinoth,
This is related to the difference between read-optimized and
write-optimized views
> 1) Use HoodieTableMetaClient and obtain the source table's commit timeline
and determine the range of commits to pull after t=0
i.e c1, c2, c3
>2) Ask HiveIncrementalPuller to pull 3 commits from commit time=0
Say we are running a COW table and
a. row 1 was updated in c3
b. A reader executes step 1
c. A writer updates row 1 in commit 4 (c4)
d. The reader proceeds to step 2.
Then my understanding is that the reader would not receive row1 in the
result from hiveincrementalpuller for the commits c1-c3. The reader would
get the value of row1 at c4 on the next read (provided row1 was not updated
subsequently). This should not be a problem usually, and I would assume
would happen infrequently as the reader would typically not wait before
actually executing the read.
However, if we were running a MOR table (and provided the compaction job
has not run in between step 1 and step 2), we would receive the value of
row 1 at state c3.
Is this correct?
Roshan
On Thu, May 9, 2019 at 3:39 AM Vinoth Chandar <[email protected]> wrote:
> sg. please keep us posted.
>
> On Wed, May 8, 2019 at 12:02 AM Roshan Nair (Data Platform)
> <[email protected]> wrote:
>
> > Vinoth,
> >
> > Thanks. We are evaluating hudi at the moment for a very specific use
> case.
> >
> > We are also looking at hive 3.0, but, I still don't see a way to do
> > incremental pulls on it. Though, we feel it might be possible to identify
> > the new commits using some the internal apis, and we are checking that.
> >
> > We also came across Databricks Delta, and it seems to be conceptually
> > similar to Hudi, though their storage format is not yet documented and
> > generally internals documentation is lacking.
> >
> > We would be very much interested in Hudi for time travel capabilities as
> > well, such as for building historical ml training data sets.
> >
> > Roshan
> >
> >
> > On Tue, May 7, 2019 at 9:16 PM Vinoth Chandar <[email protected]> wrote:
> >
> > > Hi Roshan,
> > >
> > > Thanks for writing. Yes. the user needs to manage the _commit_time
> > > watermark on the HiveIncrementalPuller path. Also you need to set the
> > table
> > > in incremental mode, providing a start commit_time and max_commits to
> > pull
> > > as documented. The DeltaStreamer tool will manage it for you
> > automatically,
> > > but it supports SparkSQL.
> > >
> > > At Uber, we have built some custom (yet simple) tools to do these steps
> > in
> > > your workflow scheduler.
> > >
> > > For e.g, let's say your commit timeline has c1, c2, c3 commits now and
> > you
> > > at at time t=0 (t corresponding to commit timestamp)
> > >
> > > 1) Use HoodieTableMetaClient and obtain the source table's commit
> > timeline
> > > and determine the range of commits to pull after t=0
> > > i.e c1, c2, c3
> > > 2) Ask HiveIncrementalPuller to pull 3 commits from commit time=0
> > > 3) Save c3 somewhere (mysql table or a folder on dfs)
> > > 4) Before the next run, say there are new commits c4, c5. We make t=3
> and
> > > end up pulling 2 commits from c3 as above.
> > >
> > > We'd love to work with you, if you are interested in standardizing this
> > > flow inside Hudi itself. :)
> > >
> > >
> > >
> > >
> > > On Mon, May 6, 2019 at 11:50 PM Roshan Nair (Data Platform)
> > > <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > > > We are trying to work out how to use hudi for incremental pulls. In
> our
> > > > scenario, we would like to read from a hudi table incrementally, so
> > that
> > > > every subsequent read only reads new data.
> > > >
> > > > In the incremental hiveql example in the quickstart (
> > > > http://hudi.incubator.apache.org/quickstart.html#incremental-hiveql
> ),
> > it
> > > > appears that I can filter on _hoodie_commit_time to select only those
> > > > records that have not been processed yet. Hudi will ensure snapshot
> > > > isolation, so no new partial writes are visible to this reader.
> > > >
> > > > The next time I want an incremental set, how do I set the
> > > > _hoodie_commit_time in the query?
> > > >
> > > > Is the expectation that the user will identify the max
> > > _hoodie_commit_time
> > > > in the result of the query and then use this to set the
> > > _hoodie_commit_time
> > > > filter for the next incremental query?
> > > >
> > > > Roshan
> > > >
> > >
> >
>