[GitHub] [hudi] vedantKhandelwalDP opened a new issue, #9478: [SUPPORT] Archival not working for hudi & corresponding hudi metadata table

via GitHub Fri, 18 Aug 2023 04:36:32 -0700


vedantKhandelwalDP opened a new issue, #9478:
URL: https://github.com/apache/hudi/issues/9478


   We were using emr version **emr-6.11.0** , hudi version **0.13.0-amzn-0** , 
spark version = **3.3.2** , hive version = **3.1.3**
   
   We recently migrated from hudi 0.12.2 to 0.13.0. The main reason for 
migration was too many files in /metadata/.hoodie/ folder i.e. archival for 
metadata table was not triggering. Reason for that was 
**https://github.com/apache/hudi/issues/7472** , as stated this was fixed in 
hudi 0.13.0.
   After migrating to hudi 0.13.0, we observed archival is not working for both 
/.hoodie folder and /.hoodie/metadata/.hoodie/.
   Table type is COW.logs:-
   for our COW table, log says,
   [2023-08-16 06:26:33,848] INFO No Instants to archive 
(org.apache.hudi.client.HoodieTimelineArchiver)for the corresponding meta data, 
log says
   [2023-08-16 06:40:14,400] INFO Not archiving as there is no compaction yet 
on the metadata table (org.apache.hudi.client.HoodieTimelineArchiver)
   [2023-08-16 06:40:14,400] INFO No Instants to archive 
(org.apache.hudi.client.HoodieTimelineArchiver)
   
   
   **Following is the complete list of hudi params[copied from logs]:-
   DEBUG Passed in properties:**
   hive_sync.support_timestamp=true
   hoodie.archive.async=true
   hoodie.archive.automatic=true
   hoodie.archivelog.folder=archived
   hoodie.bulkinsert.shuffle.parallelism=200
   hoodie.clean.async=true
   hoodie.clean.automatic=true
   hoodie.cleaner.commits.retained=2
   hoodie.cleaner.policy.failed.writes=EAGER
   hoodie.clustering.async.enabled=false
   hoodie.clustering.inline=false
   hoodie.datasource.compaction.async.enable=true
   hoodie.datasource.hive_sync.base_file_format=PARQUET
   hoodie.datasource.hive_sync.create_managed_table=false
   hoodie.datasource.hive_sync.database=<db_name>
   hoodie.datasource.hive_sync.enable=true
   hoodie.datasource.hive_sync.jdbcurl=<hive jdbc url>
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
   hoodie.datasource.hive_sync.partition_fields=dt
   hoodie.datasource.hive_sync.password=hive
   hoodie.datasource.hive_sync.schema_string_length_thresh=4000
   hoodie.datasource.hive_sync.support_timestamp=true
   hoodie.datasource.hive_sync.sync_as_datasource=true
   hoodie.datasource.hive_sync.table=table_name
   hoodie.datasource.hive_sync.use_jdbc=true
   hoodie.datasource.hive_sync.username=hive
   hoodie.datasource.meta.sync.base.path=<s3 path>
   hoodie.datasource.meta.sync.enable=true
   hoodie.datasource.write.commitmeta.key.prefix=_
   hoodie.datasource.write.drop.partition.columns=false
   hoodie.datasource.write.hive_style_partitioning=true
   hoodie.datasource.write.insert.drop.duplicates=false
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
   
hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled=false
   hoodie.datasource.write.operation=upsert
   hoodie.datasource.write.partitionpath.field=dt
   hoodie.datasource.write.partitionpath.urlencode=false
   
hoodie.datasource.write.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
   hoodie.datasource.write.precombine.field=ingestedat
   hoodie.datasource.write.reconcile.schema=false
   hoodie.datasource.write.recordkey.field=id
   hoodie.datasource.write.row.writer.enable=true
   hoodie.datasource.write.streaming.ignore.failed.batch=false
   hoodie.datasource.write.streaming.retry.count=3
   
[hoodie.datasource.write.streaming.retry.interval.ms](http://hoodie.datasource.write.streaming.retry.interval.ms/)=2000
   hoodie.datasource.write.table.type=COPY_ON_WRITE
   hoodie.fail.on.timeline.archiving=false
   hoodie.finalize.write.parallelism=200
   hoodie.insert.shuffle.parallelism=200
   hoodie.keep.max.commits=4
   hoodie.keep.min.commits=3
   hoodie.meta.sync.client.tool.class=org.apache.hudi.hive.HiveSyncTool
   hoodie.meta.sync.metadata_file_listing=true
   hoodie.meta_sync.spark.version=3.3.2-amzn-0
   hoodie.metadata.clean.async=true
   hoodie.metadata.cleaner.commits.retained=4
   hoodie.metadata.enable=true
   hoodie.metadata.keep.max.commits=7
   hoodie.metadata.keep.min.commits=5
   hoodie.metrics.pushgateway.host=<push gateway url>
   hoodie.metrics.pushgateway.port=9091
   hoodie.parquet.max.file.size=128000000
   hoodie.parquet.small.file.limit=100000000
   hoodie.payload.ordering.field=ingestedat
   hoodie.table.base.file.format=PARQUET
   hoodie.table.checksum=1229177767
   hoodie.table.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
   hoodie.table.metadata.partitions=files
   [hoodie.table.name](http://hoodie.table.name/)=table_name
   hoodie.table.partition.fields=dt
   hoodie.table.precombine.field=ingestedat
   hoodie.table.recordkey.fields=id
   hoodie.table.type=COPY_ON_WRITE
   hoodie.table.version=5
   hoodie.timeline.layout.version=1
   hoodie.upsert.shuffle.parallelism=200
   hoodie.write.concurrency.mode=single_writer
   
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
   hoodie.write.lock.zookeeper.base_path=/hudi
   hoodie.write.lock.zookeeper.port=<zk port>
   hoodie.write.lock.zookeeper.url=<zk url>
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vedantKhandelwalDP opened a new issue, #9478: [SUPPORT] Archival not working for hudi & corresponding hudi metadata table

Reply via email to