n3nash commented on issue #3054:
URL: https://github.com/apache/hudi/issues/3054#issuecomment-865424978
@FelixKJose You can do time travel in the following way:
**Using Spark**
```
Dataset<Row> hudiIncQueryDF = spark.read()
.format("org.apache.hudi")
.option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(),
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
<beginInstantTime>)
.option(DataSourceReadOptions.END_INSTANTTIME_OPT_KEY(),
<endInstantTime>)
.option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY(),
"/year=2020/month=*/day=*") // Optional, use glob pattern if querying certain
partitions
.load(tablePath); // For incremental query, pass in the root/base path
of table
hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from
hudi_trips_incremental where fare > 20.0").show()
```
**Using Hive**
```
hive_shell> set hoodie.source_table_name.consume.mode=incremental
hive_shell> set hoodie.table_name.consume.start.timestamp=<beginInstantTime>
convert_endInstantTime_to_num_commits_to_read=5
hive_shell> set hoodie.table_name.consume.max.commits=5
hive_shell> select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts
from source_table_name where fare > 20.0
```
Ideally, we should add a `hoodie.table_name.consume.end.timestamp` to
support the same behavior in Hive.
@fengjian428 For the incremental pull using Spark, the
INCR_PATH_GLOB_OPT_KEY will just be used to incrementally pull data based on
commit ranges. This works on file level. If you want to query data within a
commit range based on other columns and then use that as "incremental pull" -
Yes, that's where the data skipping index will be helpful.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]