jiangjin-f commented on issue #5911:
URL: https://github.com/apache/hudi/issues/5911#issuecomment-1161329317
> > > flink engine uses a state-backend to store the index by default, for
DeltaStreamer did you use the COW table type ?
> >
> >
> > use the COW table type . If used **--index.bootstrap.enabled=true** ,
need to set the -**-index.state.ttl=0.2** when there is a lot of data? if the
hudi table existed .Can this parameter ensure that the data can be updated?
(**--index.bootstrap.enabled=true**) delta streamer init parquet file , then
flink incremental kafka data. data not updated.
> > delta streamer write configs. `spark-submit \ --packages
org.apache.spark:spark-avro_2.11:2.4.4 \ --conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf
spark.default.parallelism=400 --num-executors 100 --executor-cores 4
--executor-memory 16G \ --conf spark.dynamicAllocation.enabled=false \ --conf
spark.yarn.heterogeneousExecutors.enabled=false \ --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
s3://****/0.10/hudi-utilities-bundle_2.11-0.10.0.jar \ **--table-type
COPY_ON_WRITE \** **--source-class
org.apache.hudi.utilities.sources.ParquetDFSSource \** --source-ordering-field
last_update_time \ --target-base-path s3://********/tablename \ --target-table
tablename \ --hoodie-conf hoodie.datasource.write.recordkey.field=primary_id \
--hoodie-conf hoodie.datasource.write.partitionpath.field=dt \ --hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
\ --hoodie-conf hoodie.datasource.write.hive_s
tyle_partitioning=true \ --hoodie-conf hoodie.delete.shuffle.parallelism=400 \
--hoodie-conf hoodie.upsert.shuffle.parallelism=400 \ --hoodie-conf
hoodie.bulkinsert.shuffle.parallelism=400 \ --hoodie-conf
hoodie.insert.shuffle.parallelism=400 \ --hoodie-conf
hoodie.datasource.write.precombine.field=last_update_time \ --hoodie-conf
hoodie.base.path = s3://********/tablename \ --hoodie-conf
hoodie.deltastreamer.schemaprovider.source.schema.file=s3://*****/source_schema.avsc
\ --hoodie-conf
hoodie.deltastreamer.schemaprovider.target.schema.file=s3://*****/target_schema.avsc
\ --hoodie-conf hoodie.datasource.write.operation=bulk_insert \ --hoodie-conf
hoodie.datasource.hive_sync.database=dw \ --hoodie-conf
hoodie.datasource.hive_sync.table=tablename \ --hoodie-conf
hoodie.datasource.hive_sync.partition_fields=dt \ --hoodie-conf
hoodie.datasource.hive_sync.assume_date_partitioning=false \ --hoodie-conf
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKe
ysValueExtractor \ --hoodie-conf
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://*******:10000 \ --hoodie-conf
hoodie.deltastreamer.checkpoint.provider.path=s3://*****/checkpoint/ \
**--hoodie-conf
hoodie.deltastreamer.source.dfs.root=s3://*****/dw.db/*******_parquet \**
--enable-hive-sync \`
>
> If you need to update data from a period of time ago, please set
index.state.ttl large than this time. `--index.bootstrap.enabled=true` can load
index from parquet and update it. Can you find this log for every task in
taskmanager?
>
> ```
> LOG.info("Task [{}}:{}}] finish loading the index under partition {} and
sending them to downstream, time cost: {} milliseconds.",
this.getClass().getSimpleName(), taskID, partitionPath, cost);
> ```
ok 。i know .i will try,then tell you test results. Thank you very much。
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]