rohit-m-99 opened a new issue #5050: URL: https://github.com/apache/hudi/issues/5050
**Describe the problem you faced** The deltastreamer requires significant amount of resources and is struggling to delete file markers during clustering. The image below shows the clustering taking over 3 hours to run. It also causes many pods to evict by requiring more than available storage. <img width="1435" alt="image" src="https://user-images.githubusercontent.com/84733594/158526765-c5d31bd5-367a-4e6e-b929-09c2c2297468.png"> **To Reproduce** Steps to reproduce the behavior: 1. Have a large number of S3 files 2. Run deltastreamer script below **Expected behavior** Deltastreamer updates should happen continuously in continuous mode. **Environment Description** * Hudi version : 10.1 * Spark version :3.0.3 * Hadoop version : 3.2.0 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : Yes **Additional context** Spark Submit Job: ``` spark-submit \ --jars /opt/spark/jars/hudi-spark3-bundle.jar,/opt/spark/jars/hadoop-aws.jar,/opt/spark/jars/aws-java-sdk.jar,/opt/spark/jars/spark-avro.jar \ --master spark://spark-master:7077 \ --driver-memory 4g \ --executor-memory 4g \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer opt/spark/jars/hudi-utilities-bundle.jar \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-table per_tick_stats \ --table-type COPY_ON_WRITE \ --continuous \ --source-ordering-field STATOVYGIYLUMVSF6YLU \ --target-base-path s3a://simian-example-prod-output/stats/querying \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://simian-example-prod-output/stats/ingesting \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator \ --hoodie-conf hoodie.datasource.write.recordkey.field=STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATONUW2X3UNFWWK___ \ --hoodie-conf hoodie.datasource.write.precombine.field=STATOVYGIYLUMVSF6YLU \ --hoodie-conf hoodie.clustering.plan.strategy.sort.columns=STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATMJQXIY3IL5ZHK3S7NFSA____ \ --hoodie-conf hoodie.clustering.inline=true \ --hoodie-conf hoodie.clustering.inline.max.commits=4 \ --hoodie-conf hoodie.datasource.write.partitionpath.field= ``` **Stacktrace** No errors just taking a lot of time. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
