stayrascal opened a new issue, #8038: URL: https://github.com/apache/hudi/issues/8038
**Describe the problem you faced** Hi, I using Flink 1.16 create some tables(COW & MOR) base HoodieCatalog, and write data to these tables, and try to use other engines to query data, but I'm confused how to use Hive to query incremental queries, I didn't found any official documents about these operation exception the demo from https://github.com/apache/hudi/blob/master/docker/demo/hive-incremental-mor-rt.commands. And I meet two problems: - Regarding a MOR table which hasn't done any compactions, count aggregation operation based on RO table can get the expected result, but query all fields get empty result. - Regarding a MOR table which has done some compactions, the incremental queries seems that not working. **To Reproduce** Steps to reproduce the behavior: Flink 1. Create Hoodie Catalog & Tables & Insert Data. ``` CREATE CATALOG hms_catalog WITH ( 'type'='hudi', 'catalog.path'='hdfs://xxxxx-1:8020/xxxx/hive', 'hive.conf.dir'='/xxxxx/hive/conf/', 'mode'='hms' ); CREATE TABLE flink_hudi_mor_tbl( uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED, name VARCHAR(10), age INT, ts TIMESTAMP(3), `partition` VARCHAR(20) ) PARTITIONED BY (`partition`) WITH ( 'connector' = 'hudi', 'table.type' = 'MERGE_ON_READ', 'hoodie.datasource.write.recordkey.field' = 'uuid', 'precombine.field' = 'ts', 'hive_sync.enabled' = 'true' ); CREATE TABLE hms_catalog.hudi_hms_db.flink_hudi_mor_streaming_tbl( uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED, name VARCHAR(10), age INT, ts TIMESTAMP(3), `partition` VARCHAR(20) ) PARTITIONED BY (`partition`) WITH ( 'connector' = 'hudi', 'table.type' = 'MERGE_ON_READ', 'hoodie.datasource.write.recordkey.field' = 'uuid', 'precombine.field' = 'ts', 'hive_sync.enabled' = 'true' ); CREATE TABLE flink_hudi_cow_tbl( uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED, name VARCHAR(10), age INT, ts TIMESTAMP(3), `partition` VARCHAR(20) ) PARTITIONED BY (`partition`) WITH ( 'connector' = 'hudi', 'table.type' = 'COPY_ON_WRITE', 'hoodie.datasource.write.recordkey.field' = 'uuid', 'precombine.field' = 'ts' ); -- write twice with different uuid, and no compaction triggered since there are only 2 commits INSERT INTO `hms_catalog`.`hudi_hms_db`.`flink_hudi_mor_tbl` VALUES ('id31','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'), ('id32','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'), ('id33','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'), ('id34','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'), ('id35','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'), ('id36','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'), ('id37','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'), ('id38','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4'); -- write to another mor table, and trigger compactions CREATE TABLE `default_catalog`.`default_database`.`fake_datasource` ( `uuid` STRING, `name` STRING, `age` INT, `ts` AS PROCTIME(), `partition` VARCHAR(20) ) WITH ( 'connector' = 'faker', 'rows-per-second' = '2', 'fields.uuid.expression' = '#{numerify ''id####''}', 'fields.name.expression' = '#{superhero.name}', 'fields.age.expression' = '#{number.numberBetween ''20'',''50''}', 'fields.partition.expression' = '#{Options.option ''par1'',''par2'',''par3'',''par4'')}', 'fields.ts.expression' = '#{date.past ''45'',''10'',''SECONDS''}' ); INSERT INTO hms_catalog.hudi_hms_db.flink_hudi_mor_streaming_tbl select `default_catalog`.`default_database`.`fake_datasource`; -- write three times with different uuid INSERT INTO `hms_catalog`.`hudi_hms_db`.`flink_hudi_cow_tbl` VALUES ('id31','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'), ('id32','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'), ('id33','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'), ('id34','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'), ('id35','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'), ('id36','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'), ('id37','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'), ('id38','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4'); ``` 2. Use Spark to show commits ``` spark-sql> call show_commits(table => 'hudi_hms_db.flink_hudi_mor_tbl'); 20230216160243458 4267 0 4 4 8 8 0 20230216160153391 4260 0 4 4 8 0 0 Time taken: 0.084 seconds, Fetched 2 row(s) spark-sql> call show_commits(table => 'hudi_hms_db.flink_hudi_cow_tbl'); 20230219145900116 1741955 0 4 4 24 0 0 20230216154007116 1741814 0 4 4 16 0 0 20230216154001168 1741175 4 0 4 8 0 0 Time taken: 0.44 seconds, Fetched 3 row(s) spark-sql> call show_commits(table => 'hudi_hms_db.flink_hudi_mor_streaming_tbl'); 20230216222718023 16248 0 4 4 60 60 0 20230216222648287 16137 0 4 4 60 60 0 20230216222617980 16256 0 4 4 60 60 0 20230216222548075 16363 0 4 4 60 60 0 20230216222548025 1917272 0 4 4 8163 71 0 20230216222518018 16266 0 4 4 60 60 0 20230216222448002 16384 0 4 4 60 60 0 20230216222418042 16273 0 4 4 60 60 0 20230216222347982 16307 0 4 4 60 60 0 20230216222318041 16275 0 4 4 60 60 0 Time taken: 0.533 seconds, Fetched 10 row(s) ``` 3. Problem 1: get empty result about query all records, but count aggregation works. ``` 0: jdbc:hive2://xxxx-1:10000/> set hive.vectorized.execution.enabled=false; 0: jdbc:hive2://xxxx-1:10000/> select count(*) from flink_hudi_mor_tbl_rt; +------+ | _c0 | +------+ | 16 | +------+ 1 row selected (8.158 seconds) 0: jdbc:hive2://xxxx-1:10000/> select * from flink_hudi_mor_tbl_rt; +--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+------------------------------------------+-----------------------------+-----------------------------+----------------------------+---------------------------+----------------------------------+ | flink_hudi_mor_tbl_rt._hoodie_commit_time | flink_hudi_mor_tbl_rt._hoodie_commit_seqno | flink_hudi_mor_tbl_rt._hoodie_record_key | flink_hudi_mor_tbl_rt._hoodie_partition_path | flink_hudi_mor_tbl_rt._hoodie_file_name | flink_hudi_mor_tbl_rt.uuid | flink_hudi_mor_tbl_rt.name | flink_hudi_mor_tbl_rt.age | flink_hudi_mor_tbl_rt.ts | flink_hudi_mor_tbl_rt.partition | +--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+------------------------------------------+-----------------------------+-----------------------------+----------------------------+---------------------------+----------------------------------+ +--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+------------------------------------------+-----------------------------+-----------------------------+----------------------------+---------------------------+----------------------------------+ No rows selected (0.143 seconds) ``` 5. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.12.2 * Spark version : 5.2.1 * Hive version : 3.1.2 * Hadoop version : 3.3.4 * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : No **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
