[GitHub] [hudi] stayrascal opened a new issue, #8038: [SUPPORT] How to query different queries via Hive 3

via GitHub Fri, 24 Feb 2023 01:21:50 -0800


stayrascal opened a new issue, #8038:
URL: https://github.com/apache/hudi/issues/8038


   **Describe the problem you faced**
   
   Hi, I using Flink 1.16 create some tables(COW & MOR) base HoodieCatalog, and 
write data to these tables, and try to use other engines to query data, but I'm 
confused how to use Hive to query incremental queries, I didn't found any 
official documents about these operation exception the demo from 
https://github.com/apache/hudi/blob/master/docker/demo/hive-incremental-mor-rt.commands.
   
   And I meet two problems:
   - Regarding a MOR table which hasn't done any compactions, count aggregation 
operation based on RO table can get the expected result, but query all fields 
get empty result.
   - Regarding a MOR table which has done some compactions, the incremental 
queries seems that not working.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   Flink
   1. Create Hoodie Catalog & Tables & Insert Data.
   ```
   CREATE CATALOG hms_catalog WITH (
       'type'='hudi',
       'catalog.path'='hdfs://xxxxx-1:8020/xxxx/hive',
       'hive.conf.dir'='/xxxxx/hive/conf/',
       'mode'='hms'
   );
   
   CREATE TABLE flink_hudi_mor_tbl(
     uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
     name VARCHAR(10),
     age INT,
     ts TIMESTAMP(3),
     `partition` VARCHAR(20)
   )
   PARTITIONED BY (`partition`)
   WITH (
     'connector' = 'hudi',
     'table.type' = 'MERGE_ON_READ',
     'hoodie.datasource.write.recordkey.field' = 'uuid',
     'precombine.field' = 'ts',
     'hive_sync.enabled' = 'true'
   );
   
   CREATE TABLE hms_catalog.hudi_hms_db.flink_hudi_mor_streaming_tbl(
     uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
     name VARCHAR(10),
     age INT,
     ts TIMESTAMP(3),
     `partition` VARCHAR(20)
   )
   PARTITIONED BY (`partition`)
   WITH (
     'connector' = 'hudi',
     'table.type' = 'MERGE_ON_READ',
     'hoodie.datasource.write.recordkey.field' = 'uuid',
     'precombine.field' = 'ts',
     'hive_sync.enabled' = 'true'
   );
   
   CREATE TABLE flink_hudi_cow_tbl(
     uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
     name VARCHAR(10),
     age INT,
     ts TIMESTAMP(3),
     `partition` VARCHAR(20)
   )
   PARTITIONED BY (`partition`)
   WITH (
     'connector' = 'hudi',
     'table.type' = 'COPY_ON_WRITE',
     'hoodie.datasource.write.recordkey.field' = 'uuid',
     'precombine.field' = 'ts'
   );
   
   -- write twice with different uuid, and no compaction triggered since there 
are only 2 commits
   INSERT INTO `hms_catalog`.`hudi_hms_db`.`flink_hudi_mor_tbl` VALUES
     ('id31','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
     ('id32','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
     ('id33','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
     ('id34','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
     ('id35','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
     ('id36','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
     ('id37','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
     ('id38','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
   
   -- write to another mor table, and trigger compactions
   CREATE TABLE `default_catalog`.`default_database`.`fake_datasource` ( 
       `uuid` STRING, 
       `name` STRING,
       `age` INT,
       `ts` AS PROCTIME(),
       `partition` VARCHAR(20)
   ) WITH (
     'connector' = 'faker', 
     'rows-per-second' = '2',
     'fields.uuid.expression' = '#{numerify ''id####''}',
     'fields.name.expression' = '#{superhero.name}',
     'fields.age.expression' = '#{number.numberBetween ''20'',''50''}',
     'fields.partition.expression' = '#{Options.option 
''par1'',''par2'',''par3'',''par4'')}',
     'fields.ts.expression' =  '#{date.past ''45'',''10'',''SECONDS''}'
   );
   
   INSERT INTO hms_catalog.hudi_hms_db.flink_hudi_mor_streaming_tbl select 
`default_catalog`.`default_database`.`fake_datasource`;
   
   -- write three times with different uuid
     INSERT INTO `hms_catalog`.`hudi_hms_db`.`flink_hudi_cow_tbl` VALUES
     ('id31','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
     ('id32','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
     ('id33','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
     ('id34','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
     ('id35','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
     ('id36','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
     ('id37','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
     ('id38','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
   ```
   2. Use Spark to show commits
   ```
   spark-sql> call show_commits(table => 'hudi_hms_db.flink_hudi_mor_tbl');
   20230216160243458        4267        0        4        4        8        8   
     0
   20230216160153391        4260        0        4        4        8        0   
     0
   Time taken: 0.084 seconds, Fetched 2 row(s)
   
   spark-sql> call show_commits(table => 'hudi_hms_db.flink_hudi_cow_tbl');
   20230219145900116        1741955        0        4        4        24        
0        0
   20230216154007116        1741814        0        4        4        16        
0        0
   20230216154001168        1741175        4        0        4        8        
0        0
   Time taken: 0.44 seconds, Fetched 3 row(s)
   
   spark-sql> call show_commits(table => 
'hudi_hms_db.flink_hudi_mor_streaming_tbl');
   20230216222718023        16248        0        4        4        60        
60        0
   20230216222648287        16137        0        4        4        60        
60        0
   20230216222617980        16256        0        4        4        60        
60        0
   20230216222548075        16363        0        4        4        60        
60        0
   20230216222548025        1917272      0        4        4        8163      
71        0
   20230216222518018        16266        0        4        4        60        
60        0
   20230216222448002        16384        0        4        4        60        
60        0
   20230216222418042        16273        0        4        4        60        
60        0
   20230216222347982        16307        0        4        4        60        
60        0
   20230216222318041        16275        0        4        4        60        
60        0
   Time taken: 0.533 seconds, Fetched 10 row(s)
   ``` 
   3. Problem 1: get empty result about query all records, but count 
aggregation works.
   ```
   0: jdbc:hive2://xxxx-1:10000/> set hive.vectorized.execution.enabled=false;
   0: jdbc:hive2://xxxx-1:10000/> select count(*) from flink_hudi_mor_tbl_rt;
   +------+
   | _c0  |
   +------+
   | 16   |
   +------+
   1 row selected (8.158 seconds)
   0: jdbc:hive2://xxxx-1:10000/> select * from flink_hudi_mor_tbl_rt;
   
+--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+------------------------------------------+-----------------------------+-----------------------------+----------------------------+---------------------------+----------------------------------+
   | flink_hudi_mor_tbl_rt._hoodie_commit_time  | 
flink_hudi_mor_tbl_rt._hoodie_commit_seqno  | 
flink_hudi_mor_tbl_rt._hoodie_record_key  | 
flink_hudi_mor_tbl_rt._hoodie_partition_path  | 
flink_hudi_mor_tbl_rt._hoodie_file_name  | flink_hudi_mor_tbl_rt.uuid  | 
flink_hudi_mor_tbl_rt.name  | flink_hudi_mor_tbl_rt.age  | 
flink_hudi_mor_tbl_rt.ts  | flink_hudi_mor_tbl_rt.partition  |
   
+--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+------------------------------------------+-----------------------------+-----------------------------+----------------------------+---------------------------+----------------------------------+
   
+--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+------------------------------------------+-----------------------------+-----------------------------+----------------------------+---------------------------+----------------------------------+
   No rows selected (0.143 seconds)
   ```
   5.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   0.12.2
   * Spark version :
   5.2.1
   * Hive version :
   3.1.2
   * Hadoop version :
   3.3.4
   * Storage (HDFS/S3/GCS..) :
   HDFS
   * Running on Docker? (yes/no) :
   No
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] stayrascal opened a new issue, #8038: [SUPPORT] How to query different queries via Hive 3

Reply via email to