sg. please keep us posted. On Wed, May 8, 2019 at 12:02 AM Roshan Nair (Data Platform) <[email protected]> wrote:
> Vinoth, > > Thanks. We are evaluating hudi at the moment for a very specific use case. > > We are also looking at hive 3.0, but, I still don't see a way to do > incremental pulls on it. Though, we feel it might be possible to identify > the new commits using some the internal apis, and we are checking that. > > We also came across Databricks Delta, and it seems to be conceptually > similar to Hudi, though their storage format is not yet documented and > generally internals documentation is lacking. > > We would be very much interested in Hudi for time travel capabilities as > well, such as for building historical ml training data sets. > > Roshan > > > On Tue, May 7, 2019 at 9:16 PM Vinoth Chandar <[email protected]> wrote: > > > Hi Roshan, > > > > Thanks for writing. Yes. the user needs to manage the _commit_time > > watermark on the HiveIncrementalPuller path. Also you need to set the > table > > in incremental mode, providing a start commit_time and max_commits to > pull > > as documented. The DeltaStreamer tool will manage it for you > automatically, > > but it supports SparkSQL. > > > > At Uber, we have built some custom (yet simple) tools to do these steps > in > > your workflow scheduler. > > > > For e.g, let's say your commit timeline has c1, c2, c3 commits now and > you > > at at time t=0 (t corresponding to commit timestamp) > > > > 1) Use HoodieTableMetaClient and obtain the source table's commit > timeline > > and determine the range of commits to pull after t=0 > > i.e c1, c2, c3 > > 2) Ask HiveIncrementalPuller to pull 3 commits from commit time=0 > > 3) Save c3 somewhere (mysql table or a folder on dfs) > > 4) Before the next run, say there are new commits c4, c5. We make t=3 and > > end up pulling 2 commits from c3 as above. > > > > We'd love to work with you, if you are interested in standardizing this > > flow inside Hudi itself. :) > > > > > > > > > > On Mon, May 6, 2019 at 11:50 PM Roshan Nair (Data Platform) > > <[email protected]> wrote: > > > > > Hi, > > > > > > We are trying to work out how to use hudi for incremental pulls. In our > > > scenario, we would like to read from a hudi table incrementally, so > that > > > every subsequent read only reads new data. > > > > > > In the incremental hiveql example in the quickstart ( > > > http://hudi.incubator.apache.org/quickstart.html#incremental-hiveql), > it > > > appears that I can filter on _hoodie_commit_time to select only those > > > records that have not been processed yet. Hudi will ensure snapshot > > > isolation, so no new partial writes are visible to this reader. > > > > > > The next time I want an incremental set, how do I set the > > > _hoodie_commit_time in the query? > > > > > > Is the expectation that the user will identify the max > > _hoodie_commit_time > > > in the result of the query and then use this to set the > > _hoodie_commit_time > > > filter for the next incremental query? > > > > > > Roshan > > > > > >
