[
https://issues.apache.org/jira/browse/HUDI-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441555#comment-17441555
]
Danny Chen commented on HUDI-2086:
----------------------------------
Fixed via master branch: a40ac62e0ce7ec807a00803d23ed223d7c607459
> redo the logical of mor_incremental_view for hive
> -------------------------------------------------
>
> Key: HUDI-2086
> URL: https://issues.apache.org/jira/browse/HUDI-2086
> Project: Apache Hudi
> Issue Type: Bug
> Components: Hive Integration
> Affects Versions: 0.9.0
> Environment: spark3.1.1
> hive3.1.1
> hadoop3.1.1
> os: suse
> Reporter: tao meng
> Assignee: tao meng
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.10.0
>
>
> now ,There are some problems with mor_incremental_view for hive。
> For example,
> 1):*hudi cannot read the lastest incremental datas which are stored by logs*
> think that: create a mor table with bulk_insert, and then do upsert for this
> table,
> no we want to query the latest incremental data by hive/sparksql, however
> the lastest incremental datas are stored by logs, when we do query nothings
> will return
> step1: prepare data
> val df = spark.sparkContext.parallelize(0 to 20, 2).map(x => testCase(x,
> x+"jack", Random.nextInt(2))).toDF()
> .withColumn("col3", expr("keyid + 3000"))
> .withColumn("p", lit(1))
> step2: do bulk_insert
> mergePartitionTable(df, 4, "default", "inc", tableType =
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> step3: do upsert
> mergePartitionTable(df, 4, "default", "inc", tableType =
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")
> step4: check the lastest commit time and do query
> spark.sql("set hoodie.inc.consume.mode=INCREMENTAL")
> spark.sql("set hoodie.inc.consume.max.commits=1")
> spark.sql("set hoodie.inc.consume.start.timestamp=20210628103935")
> spark.sql("select keyid, col3 from inc_rt where `_hoodie_commit_time` >
> '20210628103935' order by keyid").show(100, false)
> +-----+----+
> |keyid|col3|
> +-----+----+
> +-----+----+
>
> 2):*if we do insert_over_write/insert_over_write_table for hudi mor table,
> the incr query result is wrong when we want to query the data before
> insert_overwrite/insert_overwrite_table*
> step1: do bulk_insert
> mergePartitionTable(df, 4, "default", "overInc", tableType =
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")
> now the commits is
> [20210628160614.deltacommit ]
> step2: do insert_overwrite_table
> mergePartitionTable(df, 4, "default", "overInc", tableType =
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "insert_overwrite_table")
> now the commits is
> [20210628160614.deltacommit, 20210628160923.replacecommit ]
> step3: query the data before insert_overwrite_table
> spark.sql("set hoodie.overInc.consume.mode=INCREMENTAL")
> spark.sql("set hoodie.overInc.consume.max.commits=1")
> spark.sql("set hoodie.overInc.consume.start.timestamp=0")
> spark.sql("select keyid, col3 from overInc_rt where `_hoodie_commit_time` >
> '0' order by keyid").show(100, false)
> +-----+----+
> |keyid|col3|
> +-----+----+
> +-----+----+
>
> 3) *hive/presto/flink cannot read file groups which has only logs*
> when we use hbase/inmemory as index, mor table will produce log files instead
> of parquet file, but now hive/presto cannot read those files since those
> files are log files.
> *HUDI-2048* mentions this problem.
>
> however when we use spark data source to executre incremental query, there is
> no such problem above。keep the logical of mor_incremental_view for hive as
> the same logicl as spark dataSource is necessary。
> we redo the logical of mor_incremental_view for hive,to solve above problems
> and keep the logical of mor_incremental_view as the same logicl as spark
> dataSource
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)