noahtaite commented on issue #10172:
URL: https://github.com/apache/hudi/issues/10172#issuecomment-2115213114
Apologies again for the delay, we shelved clustering after this experiment
and re-generated our lake with proper file sizing.
Since this issue could affect others, I'll share my configs that got us
there:
Upsert config:
```json
{
"hoodie.datasource.hive_sync.database": "db",
"hoodie.global.simple.index.parallelism": "1920",
"hoodie.datasource.hive_sync.mode": "hms",
"hoodie.datasource.hive_sync.support_timestamp": "true",
"hoodie.schema.on.read.enable": "false",
"path": "s3://bucket/table.all_hudi",
"hoodie.datasource.write.precombine.field": "CaptureDate",
"hoodie.datasource.hive_sync.partition_fields": "datasource,year,month",
"hoodie.datasource.write.payload.class":
"org.apache.hudi.common.model.OverwriteWithLatestAvroPayload",
"hoodie.datasource.hive_sync.use_jdbc": "false",
"hoodie.meta.sync.metadata_file_listing": "true",
"hoodie.cleaner.parallelism": "1920",
"hoodie.datasource.meta.sync.enable": "true",
"hoodie.datasource.hive_sync.skip_ro_suffix": "true",
"hoodie.metadata.enable": "true",
"hoodie.datasource.hive_sync.table": "table_all",
"hoodie.datasource.meta_sync.condition.sync": "true",
"hoodie.index.type": "GLOBAL_BLOOM",
"hoodie.clean.automatic": "true",
"hoodie.datasource.write.operation": "upsert",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.write.recordkey.field": "uuid",
"hoodie.table.name": "table_all",
"hoodie.write.lock.dynamodb.billing_mode": "PAY_PER_REQUEST",
"hoodie.datasource.write.table.type": "MERGE_ON_READ",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.write.lock.dynamodb.endpoint_url": "*********(redacted)",
"hoodie.simple.index.parallelism": "1920",
"hoodie.write.lock.dynamodb.partition_key": "table_all",
"hoodie.cleaner.policy": "KEEP_LATEST_COMMITS",
"hoodie.write.concurrency.early.conflict.detection.enable": "true",
"hoodie.compact.inline": "true",
"hoodie.datasource.write.reconcile.schema": "true",
"hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.cleaner.policy.failed.writes": "LAZY",
"hoodie.keep.max.commits": "110",
"hoodie.upsert.shuffle.parallelism": "1920",
"hoodie.meta.sync.client.tool.class":
"org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool",
"hoodie.cleaner.commits.retained": "90",
"hoodie.write.lock.dynamodb.table": "hudi-lock-provider",
"hoodie.write.lock.provider":
"org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider",
"hoodie.keep.min.commits": "100",
"hoodie.datasource.write.partitionpath.field": "datasource,year,month",
"hoodie.write.concurrency.mode": "OPTIMISTIC_CONCURRENCY_CONTROL",
"hoodie.write.lock.dynamodb.region": "us-east-1"
}
```
Clustering properties:
```properties
hoodie.clustering.async.enabled=true
hoodie.clustering.async.max.commits=1
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.plan.strategy.target.file.max.bytes=524288000
hoodie.clustering.plan.strategy.small.file.limit=10485760
hoodie.clustering.preserve.commit.metadata=true
```
Clustering job:
spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob
/usr/lib/hudi/hudi-utilities-bundle.jar --props
s3://bucket/properties/nt.clustering.properties --mode scheduleAndExecute
--base-path s3://bucket/table.all_hudi/ --table-name table_all --spark-memory
90g --parallelism 1000
From what I could gather, it appears that applying soft deletes moves
records to `__HIVE_DEFAULT_PARTITION__`, and when using a global index the old
version of those records could still be visible in a snapshot query until
compaction is ran. I observed this in Hudi 0.12.1 (AWS EMR 6.9.0). I don't
currently have the bandwidth to experiment with this in our latest stable Hudi
0.13.1 (AWS EMR 6.12.0) job.
Thanks again for all your help.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]