[GitHub] [hudi] nsivabalan commented on issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

GitBox Sat, 12 Feb 2022 11:48:00 -0800


nsivabalan commented on issue #4796:
URL: https://github.com/apache/hudi/issues/4796#issuecomment-1037426901



   I could not reproduce. I also tried w/ ComplexKeyGen and empty partition 
path and no schema provider configs. yet could not reproduce. sorry. we might 
need reproducible steps w/ some dataset if feasible. 
   
   
   ```
   /bin/spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.4,org 
--driver-memory 8g --executor-memory 8g --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
path_to_/hudi-utilities-bundle_2.11-0.10.1.jar --props 
/tmp/parquet-dfs-cluster.props  --source-class 
org.apache.hudi.utilities.sources.ParquetDFSSource   --source-ordering-field 
created_at   --table-type COPY_ON_WRITE --target-base-path 
file:\/\/\/tmp/hudi-deltastreamer-gh1/   --target-table gh_hudi_tbl31  --op 
UPSERT --hoodie-conf hoodie.clustering.async.enabled=true --continuous 
--source-limit 4000000 --min-sync-interval-seconds 30
   ```
   
   properties file contents
   ```
   hoodie.datasource.write.recordkey.field=other,org.id
   hoodie.datasource.write.partitionpath.field=
   hoodie.datasource.write.precombine.field=created_at
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
   
   hoodie.metadata.enable=false
   hoodie.upsert.shuffle.parallelism=8
   hoodie.insert.shuffle.parallelism=8
   hoodie.delete.shuffle.parallelism=8
   hoodie.bulkinsert.shuffle.parallelism=8
   
   hoodie.deltastreamer.source.dfs.root=/dataset_path/
   
   hoodie.clustering.plan.strategy.sort.columns=created_at
   hoodie.clustering.plan.strategy.daybased.lookback.partitions=0
   hoodie.clustering.async.max.commits=2
   ```
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on issue #4796: [SUPPORT] Cannot run HoodieDeltaStreamer the second time when using async clustering

Reply via email to