joeytman commented on issue #9971:
URL: https://github.com/apache/hudi/issues/9971#issuecomment-1791307926
> It looks like your Flink simple hasing index does not really take effect,
is there any chance you can share the Flink options with us so that we might
find more clues about the unexcepted discrepancy.
@danny0405 In the Flink logs we do see:
> `2023-11-01 22:16:11,025 INFO
org.apache.hudi.index.bucket.HoodieBucketIndex [] - Use bucket
index, numBuckets = 113, indexFields: [redacted1, redacted2]`
So I believe the bucket index is actually working. But, I'm still happy to
provide the configuration.
Here are the Hudi args passed to the Flink job via CLI:
```
hoodie.metrics.on=true
hoodie.metrics.reporter.type=PROMETHEUS_PUSHGATEWAY
hoodie.metrics.pushgateway.host=localhost
hoodie.metrics.pushgateway.port=7075
hoodie.metrics.pushgateway.delete.on.shutdown=false
hoodie.metrics.pushgateway.random.job.name.suffix=false
hoodie.metrics.pushgateway.job.name=redacted
hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true
hoodie.index.type=BUCKET
hoodie.bucket.index.num.buckets=113
```
And some other that are hardcoded in our Flink jar:
```
var builder =
HoodiePipeline.builder(config.hudiTable)
.schema(toFlinkSchema(dataType))
.option(FlinkOptions.PATH, config.hudiBasePath)
.option(FlinkOptions.OPERATION, WriteOperationType.UPSERT)
.option(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ")
.option(FlinkOptions.PRE_COMBINE, true)
.option(
FlinkOptions.PRECOMBINE_FIELD,
DebeziumDeserializationSchema.MetadataField.TS.name)
.option(FlinkOptions.CDC_ENABLED, true)
.option(HoodieWriteConfig.AVRO_EXTERNAL_SCHEMA_TRANSFORMATION_ENABLE.key(),
true)
.option(FlinkOptions.HIVE_SYNC_ENABLED, config.hiveSync)
.option(FlinkOptions.HIVE_SYNC_DB, config.hudiDb)
.options(config.hudiParams);
```
And the Flink options:
```
-Dexecution.checkpointing.interval=15m
-Dexecution.checkpointing.min-pause=15m
-Dstate.checkpoints.dir=s3p://<redacted>/
-Dtaskmanager.memory.process.size=28g
-Djobmanager.memory.process.size=6g
-Dyarn.application.name=<redacted>
-Dyarn.tags=<redacted>
-Dtaskmanager.memory.framework.off-heap.size=4G
-Dtaskmanager.memory.task.off-heap.size=4G
-Dexecution.checkpointing.timeout=100m
-Dtaskmanager.memory.managed.fraction=0.35
-Dparallelism.default=4
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]