[GitHub] [hudi] yuzhaojing commented on issue #5911: [SUPPORT] delta streamer init Parquet file then flink incremental data , Data not updated

GitBox Mon, 20 Jun 2022 23:30:31 -0700


yuzhaojing commented on issue #5911:
URL: https://github.com/apache/hudi/issues/5911#issuecomment-1161319317


   
   
   
   > > flink engine uses a state-backend to store the index by default, for 
DeltaStreamer did you use the COW table type ?
   > 
   > use the COW table type . If used **--index.bootstrap.enabled=true** , need 
to set the -**-index.state.ttl=0.2** when there is a lot of data? if the hudi 
table existed .Can this parameter ensure that the data can be updated? 
(**--index.bootstrap.enabled=true**) delta streamer init parquet file , then 
flink incremental kafka data. data not updated.
   > 
   > delta streamer write configs. `spark-submit \ --packages 
org.apache.spark:spark-avro_2.11:2.4.4 \ --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 
spark.default.parallelism=400 --num-executors 100 --executor-cores 4 
--executor-memory 16G \ --conf spark.dynamicAllocation.enabled=false \ --conf 
spark.yarn.heterogeneousExecutors.enabled=false \ --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
s3://****/0.10/hudi-utilities-bundle_2.11-0.10.0.jar \ **--table-type 
COPY_ON_WRITE \** **--source-class 
org.apache.hudi.utilities.sources.ParquetDFSSource \** --source-ordering-field 
last_update_time \ --target-base-path s3://********/tablename \ --target-table 
tablename \ --hoodie-conf hoodie.datasource.write.recordkey.field=primary_id \ 
--hoodie-conf hoodie.datasource.write.partitionpath.field=dt \ --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
 \ --hoodie-conf hoodie.datasource.write.hive_sty
 le_partitioning=true \ --hoodie-conf hoodie.delete.shuffle.parallelism=400 \ 
--hoodie-conf hoodie.upsert.shuffle.parallelism=400 \ --hoodie-conf 
hoodie.bulkinsert.shuffle.parallelism=400 \ --hoodie-conf 
hoodie.insert.shuffle.parallelism=400 \ --hoodie-conf 
hoodie.datasource.write.precombine.field=last_update_time \ --hoodie-conf 
hoodie.base.path = s3://********/tablename \ --hoodie-conf 
hoodie.deltastreamer.schemaprovider.source.schema.file=s3://*****/source_schema.avsc
 \ --hoodie-conf 
hoodie.deltastreamer.schemaprovider.target.schema.file=s3://*****/target_schema.avsc
 \ --hoodie-conf hoodie.datasource.write.operation=bulk_insert \ --hoodie-conf 
hoodie.datasource.hive_sync.database=dw \ --hoodie-conf 
hoodie.datasource.hive_sync.table=tablename \ --hoodie-conf 
hoodie.datasource.hive_sync.partition_fields=dt \ --hoodie-conf 
hoodie.datasource.hive_sync.assume_date_partitioning=false \ --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeys
 ValueExtractor \ --hoodie-conf 
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://*******:10000 \ --hoodie-conf 
hoodie.deltastreamer.checkpoint.provider.path=s3://*****/checkpoint/ \ 
**--hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://*****/dw.db/*******_parquet \** 
--enable-hive-sync \`
   
   If you need to update data from a period of time ago, please set 
index.state.ttl large than this time.
   `--index.bootstrap.enabled=true` can load index from parquet and update it. 
Can you find this log for every task in taskmanager?
   ```
   LOG.info("Task [{}}:{}}] finish loading the index under partition {} and 
sending them to downstream, time cost: {} milliseconds.", 
this.getClass().getSimpleName(), taskID, partitionPath, cost);
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yuzhaojing commented on issue #5911: [SUPPORT] delta streamer init Parquet file then flink incremental data , Data not updated

Reply via email to