maheshguptags opened a new issue, #7589:
URL: https://github.com/apache/hudi/issues/7589
**Want to clean only the Avro and Parquet file not the clustered file(
previous version of clustered file)**
Hi Team,
I want to perform clustering with cleaning which is working fine but my use
case is bit different where I want to clean all AVRO + parquet( being generated
by compaction) but keep all the clustered file to maintain the historical data
of users.
Steps to reproduce the behavior:
** hudi Configuration**
hudi_options_write = {
'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
'hoodie.datasource.write.recordkey.field': 'xxxx,yyyy',
'hoodie.table.name': tableName,
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.archivelog.folder': 'archived',
'hoodie.datasource.write.partitionpath.field': 'xxxxx',
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.ComplexKeyGenerator',
'hoodie.datasource.write.partitionpath.urlencode': 'false',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.datasource.write.precombine.field': 'updated_date',
'hoodie.compact.inline.max.delta.commits': 4,
'hoodie.clean.automatic': 'false',
'hoodie.compact.inline': 'true',
'hoodie.parquet.small.file.limit': '0',
'hoodie.clustering.inline': 'true',
'hoodie.clustering.inline.max.commits': '4',
'hoodie.clustering.plan.strategy.target.file.max.bytes': '1073741824',
'hoodie.clustering.plan.strategy.small.file.limit': '629145600',
##Files smaller than the size in bytes specified here are candidates for
clustering
'hoodie.clustering.plan.strategy.sort.columns': 'xxxxx'
}
**Expected behavior**
After every cleaning procedure it would clean only AVRO and
PARQUET(compacted ) not Clustered.
**Environment Description**
* Hudi version : 11.01
* Spark version : Spark 3.3.0
* Hive version : Hive 3.1.3
* Hadoop version : Hadoop 3.2.1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : No
**Stacktrace**
I want to explore the feature to store the historical data by keeping the
clustered parquet file. If this is supported by existing hudi system, Can
someone from hudi team help me to achieve this, if not then is there any
workaround for it?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]