joeytman commented on issue #9971:
URL: https://github.com/apache/hudi/issues/9971#issuecomment-1791307926

   > It looks like your Flink simple hasing index does not really take effect, 
is there any chance you can share the Flink options with us so that we might 
find more clues about the unexcepted discrepancy.
   
   @danny0405 In the Flink logs we do see:
   > `2023-11-01 22:16:11,025 INFO  
org.apache.hudi.index.bucket.HoodieBucketIndex               [] - Use bucket 
index, numBuckets = 113, indexFields: [redacted1, redacted2]`
   
   So I believe the bucket index is actually working. But, I'm still happy to 
provide the configuration.
   
   Here are the Hudi args passed to the Flink job via CLI:
   
   ```
    hoodie.metrics.on=true
    hoodie.metrics.reporter.type=PROMETHEUS_PUSHGATEWAY
    hoodie.metrics.pushgateway.host=localhost
    hoodie.metrics.pushgateway.port=7075
    hoodie.metrics.pushgateway.delete.on.shutdown=false
    hoodie.metrics.pushgateway.random.job.name.suffix=false
    hoodie.metrics.pushgateway.job.name=redacted
    hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true
    hoodie.index.type=BUCKET
    hoodie.bucket.index.num.buckets=113
   ```
   And some other that are hardcoded in our Flink jar:
   ```
   
       var builder =
           HoodiePipeline.builder(config.hudiTable)
               .schema(toFlinkSchema(dataType))
               .option(FlinkOptions.PATH, config.hudiBasePath)
               .option(FlinkOptions.OPERATION, WriteOperationType.UPSERT)
               .option(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ") 
               .option(FlinkOptions.PRE_COMBINE, true)
               .option(
                   FlinkOptions.PRECOMBINE_FIELD, 
DebeziumDeserializationSchema.MetadataField.TS.name)
               .option(FlinkOptions.CDC_ENABLED, true)
               
.option(HoodieWriteConfig.AVRO_EXTERNAL_SCHEMA_TRANSFORMATION_ENABLE.key(), 
true)
               .option(FlinkOptions.HIVE_SYNC_ENABLED, config.hiveSync)
               .option(FlinkOptions.HIVE_SYNC_DB, config.hudiDb)
               .options(config.hudiParams);
   ```
   
   And the Flink options:
   ```
   
    -Dexecution.checkpointing.interval=15m
    -Dexecution.checkpointing.min-pause=15m
    -Dstate.checkpoints.dir=s3p://<redacted>/
    -Dtaskmanager.memory.process.size=28g
    -Djobmanager.memory.process.size=6g
    -Dyarn.application.name=<redacted>
    -Dyarn.tags=<redacted>
    -Dtaskmanager.memory.framework.off-heap.size=4G
    -Dtaskmanager.memory.task.off-heap.size=4G
    -Dexecution.checkpointing.timeout=100m
    -Dtaskmanager.memory.managed.fraction=0.35
    -Dparallelism.default=4 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to