[GitHub] [hudi] qingyuan18 opened a new issue #5163: [SUPPORT]

GitBox Mon, 28 Mar 2022 20:30:49 -0700


qingyuan18 opened a new issue #5163:
URL: https://github.com/apache/hudi/issues/5163



   **Describe the problem you faced**
   
   - hudi not support compaction with flink sql directly insert into MOR table 
   - when using flink sql hudi connector to insert bounded data into MOR table 
, hudi not  support compaction avro log files into parquet ，neither using hudi 
cli nor flink compaction utility
   - this will effect the Trino/PrestoDB ‘s query for MOR ro table, as they 
can't retrieve result while no parquet file generated
   - unless using spark to initial the parquet for MOR table ,or using  Flink 
streaming to continually ingestion into MOR table, the online/offline 
compaction works 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. set up flink hudi integration env & libs( 
hudi-flink-bundle_2.12-0.10.1.jar, Flink 1.13.1, Hudi 0.10.1)
   2. start flink on yarn session mode
   flink-yarn-session -jm 1024 -tm 4096 -s 2 \
   -D state.backend=rocksdb \
   -D state.checkpoint-storage=filesystem \
   -D state.checkpoints.dir=${checkpoints} \
   -D execution.checkpointing.interval=60000 \
   -D state.checkpoints.num-retained=5 \
   -D execution.checkpointing.mode=EXACTLY_ONCE \
   -D 
execution.checkpointing.externalized-checkpoint-retention=RETAIN_ON_CANCELLATION
 \
   -D state.backend.incremental=true \
   -D execution.checkpointing.max-concurrent-checkpoints=1 \
   -D rest.flamegraph.enabled=true \
   -d \
   -t /etc/hive/conf/hive-site.xml
   3. start flink sql client and create flink hudi MOR table 
   /usr/lib/flink/bin/sql-client.sh -s application_1648449519844_0018
   CREATE TABLE t4(
     uuid VARCHAR(20),
     name VARCHAR(10),
     age INT,
     ts TIMESTAMP(3),
     `partition` VARCHAR(20)
   )
   PARTITIONED BY (`partition`)
   WITH (
     'connector' = 'hudi',
     'path' = 's3://emrfssampledata/flink-hudi/t4',
     'table.type' = 'MERGE_ON_READ',  -- If MERGE_ON_READ, hive query will not 
have output until the parquet file is generated
   'compaction.tasks' = '1',  
   'compaction.async.enabled' = 'true',
   'compaction.trigger.strategy' = 'num_or_time',
    'compaction.delta_commits' ='1',
   'compaction.delta_seconds' = '60',
     'hive_sync.enable' = 'true',     -- Required. To enable hive 
synchronization
     'hive_sync.mode' = 'HMS',
     'hive_sync.use_jdbc' = 'false',
     'hive_sync.username' = 'hadoop',
     'hive_sync.db' = 'flinkstreamdb',
     'hive_sync.table' = 't4',
     'hive_sync.partition_fields' = 'partition',
     'hive_sync.partition_extractor_class' = 
'org.apache.hudi.hive.MultiPartKeysValueExtractor'
   );
   3. using flink sql directly insert into MOR table
   INSERT INTO t4 VALUES
     ('id1','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
     ('id103','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
     ('id105','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
     ('id406','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3')
   
   4. using flink compaction or hudi cli to compact the avro logs into parquet
   ./bin/flink run -c org.apache.hudi.sink.compact.HoodieFlinkCompactor 
lib/hudi-flink-bundle_2.12-0.10.1.jar  --path s3://emrfssampledata/flink-hudi/t4
   ./hudi-cli/hudi-cli.sh
   #connect --path s3://emrfssampledata/flink-hudi/t4
   #compaction run --parallelism 100 --sparkMemory 1g --retry 1 
--compactionInstant 20210602101315 --hoodieConfigs 
'hoodie.compaction.strategy=org.apache.hudi.table.action.compact.strategy.BoundedIOCompactionStrategy,hoodie.compaction.target.io21,hoodie.compact.inline.max.delta.commits=1'
 --propsFilePath s3://emrfssampledata/flink-hudi/t4/.hoodie/hoodie.properties 
--schemaFilePath  s3://emrfssampledata/flink-hudi/t4/.hoodie/t4.json,
   
    7. using Trino/PrestoDB to query the hudi MOR RO view
   #presto-cli --catalog=hive
   #select * from t4_ro
   
   **Expected behavior**
   can query the Hudi MOR RO view dataset but no result retrieved 
   
   **Environment Description**
   
   * Hudi version : 0.10.1
   
   * Spark version : 3.1.2
   
   * flink version : 1.13.1
    
   * Hive version : 3.2
   
   * Hadoop version : 3.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   no error , but the compaction operation return with no parquet files 
genernated 
   
   **Expected behavior**
   
   parquet files compacted by avro based on the flink hudi table props with 
delta commit/strategy/commit_nums...etc 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] qingyuan18 opened a new issue #5163: [SUPPORT]

Reply via email to