[GitHub] [hudi] maheshguptags opened a new issue, #7589: Keep only clustered file(all) after cleaning

GitBox Mon, 02 Jan 2023 02:49:30 -0800


maheshguptags opened a new issue, #7589:
URL: https://github.com/apache/hudi/issues/7589


   
   **Want to clean only the Avro and Parquet file not the clustered file( 
previous version of clustered file)**
   
   Hi Team, 
   I want to perform clustering with cleaning which is working fine but my use 
case is bit different where I want to clean all AVRO + parquet( being generated 
by compaction) but keep all the clustered file to maintain the historical data 
of users.  
   
   Steps to reproduce the behavior:
   
   ** hudi Configuration**
   hudi_options_write = {
       'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
       'hoodie.datasource.write.recordkey.field': 'xxxx,yyyy',
       'hoodie.table.name': tableName,
       'hoodie.datasource.write.hive_style_partitioning': 'true',
       'hoodie.archivelog.folder': 'archived',
       'hoodie.datasource.write.partitionpath.field': 'xxxxx',
       'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator',
       'hoodie.datasource.write.partitionpath.urlencode': 'false',
       'hoodie.upsert.shuffle.parallelism': 2,
       'hoodie.datasource.write.precombine.field': 'updated_date',
       'hoodie.compact.inline.max.delta.commits': 4,
       'hoodie.clean.automatic': 'false', 
       'hoodie.compact.inline': 'true',
       'hoodie.parquet.small.file.limit': '0',
       'hoodie.clustering.inline': 'true',
       'hoodie.clustering.inline.max.commits': '4',
       'hoodie.clustering.plan.strategy.target.file.max.bytes': '1073741824',
       'hoodie.clustering.plan.strategy.small.file.limit': '629145600',
       ##Files smaller than the size in bytes specified here are candidates for 
clustering
       'hoodie.clustering.plan.strategy.sort.columns': 'xxxxx'
   
   }
   
   
   **Expected behavior**
   
   After every cleaning procedure it would clean only AVRO and 
PARQUET(compacted ) not Clustered.
   
   **Environment Description**
   
   * Hudi version : 11.01
   
   * Spark version : Spark 3.3.0
   
   * Hive version : Hive 3.1.3
   
   * Hadoop version : Hadoop 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   **Stacktrace**
   
    I want to explore the feature to store the historical data by keeping the 
clustered parquet file. If this is supported by existing hudi system, Can 
someone from hudi team help me to achieve this, if not then is there any 
workaround for it?
    
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] maheshguptags opened a new issue, #7589: Keep only clustered file(all) after cleaning

Reply via email to