[GitHub] [hudi] jiangjin-f commented on issue #5911: [SUPPORT] delta streamer init Parquet file then flink incremental data , Data not updated

GitBox Mon, 20 Jun 2022 23:01:44 -0700


jiangjin-f commented on issue #5911:
URL: https://github.com/apache/hudi/issues/5911#issuecomment-1161300973


   > flink engine uses a state-backend to store the index by default, for 
DeltaStreamer did you use the COW table type ?
   
   use the COW table type . 
   If used **--index.bootstrap.enabled=true** ,  need to set the 
-**-index.state.ttl=0.2** when there is a lot of data?
   if the hudi table existed .Can this parameter ensure that the data can be 
updated? (**--index.bootstrap.enabled=true**)
   delta streamer init parquet file , then flink incremental kafka data. data 
not updated.
   
   delta streamer write configs.  
   `spark-submit \
     --packages org.apache.spark:spark-avro_2.11:2.4.4 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
     --conf spark.default.parallelism=400 --num-executors 100 --executor-cores 
4 --executor-memory 16G \
     --conf spark.dynamicAllocation.enabled=false \
     --conf spark.yarn.heterogeneousExecutors.enabled=false \
     --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
s3://****/0.10/hudi-utilities-bundle_2.11-0.10.0.jar  \
     **--table-type COPY_ON_WRITE   \**
     **--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \**
     --source-ordering-field last_update_time    \
     --target-base-path s3://********/tablename  \
     --target-table tablename \
     --hoodie-conf hoodie.datasource.write.recordkey.field=primary_id \
     --hoodie-conf hoodie.datasource.write.partitionpath.field=dt \
     --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
 \
     --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
     --hoodie-conf hoodie.delete.shuffle.parallelism=400 \
     --hoodie-conf hoodie.upsert.shuffle.parallelism=400 \
     --hoodie-conf hoodie.bulkinsert.shuffle.parallelism=400 \
     --hoodie-conf hoodie.insert.shuffle.parallelism=400 \
     --hoodie-conf hoodie.datasource.write.precombine.field=last_update_time \
     --hoodie-conf hoodie.base.path =  s3://********/tablename \
     --hoodie-conf 
hoodie.deltastreamer.schemaprovider.source.schema.file=s3://*****/source_schema.avsc
 \
     --hoodie-conf 
hoodie.deltastreamer.schemaprovider.target.schema.file=s3://*****/target_schema.avsc
 \
     --hoodie-conf hoodie.datasource.write.operation=bulk_insert \
     --hoodie-conf hoodie.datasource.hive_sync.database=dw \
     --hoodie-conf hoodie.datasource.hive_sync.table=tablename \
     --hoodie-conf hoodie.datasource.hive_sync.partition_fields=dt \
     --hoodie-conf hoodie.datasource.hive_sync.assume_date_partitioning=false \
     --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
 \
     --hoodie-conf 
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://*******:10000 \
     --hoodie-conf 
hoodie.deltastreamer.checkpoint.provider.path=s3://*****/checkpoint/ \
     **--hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://*****/dw.db/*******_parquet \**
     --enable-hive-sync \`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jiangjin-f commented on issue #5911: [SUPPORT] delta streamer init Parquet file then flink incremental data , Data not updated

Reply via email to