[GitHub] [hudi] rohit-m-99 opened a new issue #5037: [SUPPORT] Deltastreamer continuous mode not working with high number of files in S3

GitBox Mon, 14 Mar 2022 10:30:44 -0700


rohit-m-99 opened a new issue #5037:
URL: https://github.com/apache/hudi/issues/5037



   **Describe the problem you faced**
   
   The deltastreamer fails to pick up any updates in the folder despite running 
in continuous mode when there are many new files. There is no jobs being run in 
the delta-streamer UI. This happens even when writing to a brand new table.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Have a large number of S3 files -> for us problem occurs around 10k
   2. Run deltastreamer script below
   3. No updates are found
   
   **Expected behavior**
   
   Deltastreamer updates should happen continuously in continuous mode.
   
   **Environment Description**
   
   * Hudi version : 10.1
   * Spark version :3.0.3
   * Hadoop version : 3.2.0
   * Storage (HDFS/S3/GCS..) : S3
   * Running on Docker? (yes/no) : Yes
   
   **Additional context**
   
   Spark Submit Job:
   
   ```
   spark-submit \
   --jars 
/opt/spark/jars/hudi-spark3-bundle.jar,/opt/spark/jars/hadoop-aws.jar,/opt/spark/jars/aws-java-sdk.jar,/opt/spark/jars/spark-avro.jar
 \
   --master spark://spark-master:7077 \
   --total-executor-cores 10 \
   --driver-memory 4g \
   --executor-memory 4g \
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
opt/spark/jars/hudi-utilities-bundle.jar \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
   --target-table per_tick_stats \
   --table-type COPY_ON_WRITE \
   --continuous \
   --source-ordering-field STATOVYGIYLUMVSF6YLU \
   --target-base-path s3a://simian-kodiak-prod-output/stats/querying \
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3a://simian-kodiak-prod-output/stats/ingesting
 \
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
 \
   --hoodie-conf 
hoodie.datasource.write.recordkey.field=STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATONUW2X3UNFWWK___
 \
   --hoodie-conf hoodie.datasource.write.precombine.field=STATOVYGIYLUMVSF6YLU \
   --hoodie-conf 
hoodie.clustering.plan.strategy.sort.columns=STATONUW25LMMF2GS33OL5ZHK3S7NFSA____,STATMJQXIY3IL5ZHK3S7NFSA____
 \
   --hoodie-conf hoodie.clustering.inline=true \
   --hoodie-conf hoodie.clustering.inline.max.commits=4 \
   --hoodie-conf hoodie.datasource.write.partitionpath.field= \
   --source-limit 2147483648
   ```
   
   **Stacktrace**
   
   No log output or errors. Last logs below:
   
   ```
   22/03/11 03:55:32 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://simian-customer-prod-output/stats/querying
   22/03/11 03:55:32 INFO HoodieTableConfig: Loading table properties from 
s3a://simian-customer-prod-output/stats/querying/.hoodie/hoodie.properties
   22/03/11 03:55:32 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://simian-customer-prod-output/stats/querying
   22/03/11 03:55:33 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220310060713509__rollback__COMPLETED]}
   22/03/11 03:55:33 INFO DFSPathSelector: Using path selector 
org.apache.hudi.utilities.sources.helpers.DFSPathSelector
   22/03/11 03:55:33 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://simian-customer-prod-output/stats/querying
   22/03/11 03:55:33 INFO HoodieTableConfig: Loading table properties from 
s3a://simian-customer-prod-output/stats/querying/.hoodie/hoodie.properties
   22/03/11 03:55:33 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://simian-customer-prod-output/stats/querying
   22/03/11 03:55:33 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220310060713509__rollback__COMPLETED]}
   22/03/11 03:55:33 INFO DeltaSync: Checkpoint to resume from : 
Option{val=1646891776000}
   22/03/11 03:55:33 INFO DFSPathSelector: Root path => 
s3a://simian-customer-prod-output/stats/ingesting source limit => 2147483648
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rohit-m-99 opened a new issue #5037: [SUPPORT] Deltastreamer continuous mode not working with high number of files in S3

Reply via email to