rohit-m-99 opened a new issue #5050:
URL: https://github.com/apache/hudi/issues/5050


   **Describe the problem you faced**
   
   The deltastreamer requires significant amount of resources and is struggling 
to delete file markers during clustering. The image below shows the clustering 
taking over 3 hours to run. It also causes many pods to evict by requiring more 
than available storage.
   
   <img width="1435" alt="image" 
src="https://user-images.githubusercontent.com/84733594/158526765-c5d31bd5-367a-4e6e-b929-09c2c2297468.png";>
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Have a large number of S3 files
   2. Run deltastreamer script below
   
   **Expected behavior**
   
   Deltastreamer updates should happen continuously in continuous mode.
   
   **Environment Description**
   
   * Hudi version : 10.1
   * Spark version :3.0.3
   * Hadoop version : 3.2.0
   * Storage (HDFS/S3/GCS..) : S3
   * Running on Docker? (yes/no) : Yes
   
   **Additional context**
   
   Spark Submit Job:
   
   ```
   spark-submit \
   --jars 
/opt/spark/jars/hudi-spark3-bundle.jar,/opt/spark/jars/hadoop-aws.jar,/opt/spark/jars/aws-java-sdk.jar,/opt/spark/jars/spark-avro.jar
 \
   --master spark://spark-master:7077 \
   --driver-memory 4g \
   --executor-memory 4g \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
opt/spark/jars/hudi-utilities-bundle.jar \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
   --target-table per_tick_stats \
   --table-type COPY_ON_WRITE \
   --continuous \
   --source-ordering-field STATOVYGIYLUMVSF6YLU \
   --target-base-path s3a://simian-example-prod-output/stats/querying \
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3a://simian-example-prod-output/stats/ingesting
 \
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
 \
   --hoodie-conf 
hoodie.datasource.write.recordkey.field=STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATONUW2X3UNFWWK___
 \
   --hoodie-conf hoodie.datasource.write.precombine.field=STATOVYGIYLUMVSF6YLU \
   --hoodie-conf 
hoodie.clustering.plan.strategy.sort.columns=STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATMJQXIY3IL5ZHK3S7NFSA____
 \
   --hoodie-conf hoodie.clustering.inline=true \
   --hoodie-conf hoodie.clustering.inline.max.commits=4 \
   --hoodie-conf hoodie.datasource.write.partitionpath.field= 
   ```
   
   **Stacktrace**
   
   No errors just taking a lot of time.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to