[GitHub] [hudi] n3nash commented on issue #3054: [SUPPORT] Point query at hudi tables

GitBox Mon, 21 Jun 2021 17:03:59 -0700


n3nash commented on issue #3054:
URL: https://github.com/apache/hudi/issues/3054#issuecomment-865424978



   @FelixKJose You can do time travel in the following way: 
   
   **Using Spark**
   
   ```
   Dataset<Row> hudiIncQueryDF = spark.read()
        .format("org.apache.hudi")
        .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), 
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
        .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(), 
<beginInstantTime>)
        .option(DataSourceReadOptions.END_INSTANTTIME_OPT_KEY(), 
<endInstantTime>)
        .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY(), 
"/year=2020/month=*/day=*") // Optional, use glob pattern if querying certain 
partitions
        .load(tablePath); // For incremental query, pass in the root/base path 
of table
        
   hudiIncQueryDF.createOrReplaceTempView("hudi_trips_incremental")
   spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from 
 hudi_trips_incremental where fare > 20.0").show()
   ```
   
   **Using Hive**
   
   ```
   hive_shell> set hoodie.source_table_name.consume.mode=incremental
   hive_shell> set hoodie.table_name.consume.start.timestamp=<beginInstantTime>
   convert_endInstantTime_to_num_commits_to_read=5
   hive_shell> set hoodie.table_name.consume.max.commits=5
   hive_shell> select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts 
from  source_table_name where fare > 20.0
   ```
   
   Ideally, we should add a `hoodie.table_name.consume.end.timestamp` to 
support the same behavior in Hive. 
   
   @fengjian428 For the incremental pull using Spark, the 
INCR_PATH_GLOB_OPT_KEY will just be used to incrementally pull data based on 
commit ranges. This works on file level. If you want to query data within a 
commit range based on other columns and then use that as "incremental pull" - 
Yes, that's where the data skipping index will be helpful. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] n3nash commented on issue #3054: [SUPPORT] Point query at hudi tables

Reply via email to