Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

via GitHub Thu, 16 May 2024 06:12:35 -0700


noahtaite commented on issue #10172:
URL: https://github.com/apache/hudi/issues/10172#issuecomment-2115213114


   Apologies again for the delay, we shelved clustering after this experiment 
and re-generated our lake with proper file sizing.
   
   Since this issue could affect others, I'll share my configs that got us 
there:
   Upsert config:
   ```json
   {
     "hoodie.datasource.hive_sync.database": "db",
     "hoodie.global.simple.index.parallelism": "1920",
     "hoodie.datasource.hive_sync.mode": "hms",
     "hoodie.datasource.hive_sync.support_timestamp": "true",
     "hoodie.schema.on.read.enable": "false",
     "path": "s3://bucket/table.all_hudi",
     "hoodie.datasource.write.precombine.field": "CaptureDate",
     "hoodie.datasource.hive_sync.partition_fields": "datasource,year,month",
     "hoodie.datasource.write.payload.class": 
"org.apache.hudi.common.model.OverwriteWithLatestAvroPayload",
     "hoodie.datasource.hive_sync.use_jdbc": "false",
     "hoodie.meta.sync.metadata_file_listing": "true",
     "hoodie.cleaner.parallelism": "1920",
     "hoodie.datasource.meta.sync.enable": "true",
     "hoodie.datasource.hive_sync.skip_ro_suffix": "true",
     "hoodie.metadata.enable": "true",
     "hoodie.datasource.hive_sync.table": "table_all",
     "hoodie.datasource.meta_sync.condition.sync": "true",
     "hoodie.index.type": "GLOBAL_BLOOM",
     "hoodie.clean.automatic": "true",
     "hoodie.datasource.write.operation": "upsert",
     "hoodie.datasource.hive_sync.enable": "true",
     "hoodie.datasource.write.recordkey.field": "uuid",
     "hoodie.table.name": "table_all",
     "hoodie.write.lock.dynamodb.billing_mode": "PAY_PER_REQUEST",
     "hoodie.datasource.write.table.type": "MERGE_ON_READ",
     "hoodie.datasource.write.hive_style_partitioning": "true",
     "hoodie.write.lock.dynamodb.endpoint_url": "*********(redacted)",
     "hoodie.simple.index.parallelism": "1920",
     "hoodie.write.lock.dynamodb.partition_key": "table_all",
     "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS",
     "hoodie.write.concurrency.early.conflict.detection.enable": "true",
     "hoodie.compact.inline": "true",
     "hoodie.datasource.write.reconcile.schema": "true",
     "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
     "hoodie.cleaner.policy.failed.writes": "LAZY",
     "hoodie.keep.max.commits": "110",
     "hoodie.upsert.shuffle.parallelism": "1920",
     "hoodie.meta.sync.client.tool.class": 
"org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool",
     "hoodie.cleaner.commits.retained": "90",
     "hoodie.write.lock.dynamodb.table": "hudi-lock-provider",
     "hoodie.write.lock.provider": 
"org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider",
     "hoodie.keep.min.commits": "100",
     "hoodie.datasource.write.partitionpath.field": "datasource,year,month",
     "hoodie.write.concurrency.mode": "OPTIMISTIC_CONCURRENCY_CONTROL",
     "hoodie.write.lock.dynamodb.region": "us-east-1"
   }
   ```
   
   Clustering properties:
   ```properties
   hoodie.clustering.async.enabled=true
   hoodie.clustering.async.max.commits=1
   
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   hoodie.clustering.plan.strategy.target.file.max.bytes=524288000
   hoodie.clustering.plan.strategy.small.file.limit=10485760
   hoodie.clustering.preserve.commit.metadata=true
   ```
   
   Clustering job:
   spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob 
/usr/lib/hudi/hudi-utilities-bundle.jar --props 
s3://bucket/properties/nt.clustering.properties --mode scheduleAndExecute 
--base-path s3://bucket/table.all_hudi/ --table-name table_all --spark-memory 
90g --parallelism 1000
   
   
   From what I could gather, it appears that applying soft deletes moves 
records to `__HIVE_DEFAULT_PARTITION__`, and when using a global index the old 
version of those records could still be visible in a snapshot query until 
compaction is ran. I observed this in Hudi 0.12.1 (AWS EMR 6.9.0). I don't 
currently have the bandwidth to experiment with this in our latest stable Hudi 
0.13.1 (AWS EMR 6.12.0) job.
   
   Thanks again for all your help.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] Additional records in dataset after clustering [hudi]

Reply via email to