tatiana-rackspace opened a new issue, #8085: URL: https://github.com/apache/hudi/issues/8085
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** Can you please help us understand - when delta commit is triggered for MoR tables - what are the criteria? Is it by number of records or by number of seconds? **To Reproduce** We are running delta streamer on EMR to ingest files from S3. deltastreamer config: ``` spark-submit \ --jars /usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-aws-bundle.jar \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer /usr/lib/hudi/hudi-utilities-bundle.jar \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --source-ordering-field ts \ --target-base-path s3a://hudi-test-table/deltastreamer_test_npartitioned/ \ --target-table deltastreamer_test_npartitioned \ --enable-sync \ --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool \ --table-type MERGE_ON_READ \ --op UPSERT \ --continuous \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://hudi-test-s3-target/parquet/public/users_cdc_test/2023/03/01/ \ --hoodie-conf hoodie.datasource.hive_sync.mode=hms \ --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \ --hoodie-conf hoodie.datasource.hive_sync.database=hudideltastreamer \ --hoodie-conf hoodie.datasource.hive_sync.table=deltastreamer_test_npartitioned \ --hoodie-conf hoodie.datasource.write.recordkey.field=user_id \ --hoodie-conf hoodie.datasource.write.partitionpath.field="" \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator \ --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor ``` Trying to test how it ingests 1000 small files(around 10 Kb each) from S3 (inserted 80000 rows into table). 1.Test: Files are generated and placed to S3 first. Deltastreamer starts after all files are there, ingests them and produces single delta commit. 2.Test: The same amount of data - 1000 small files(around 10 Kb each). Files are generated and placed to S3 and deltastreamer running at the same time in continuous mode. Timeline: 12:00 Deltastreamer started running in continuous mode 12:03.48 Parquet files started arriving to s3 from 12:03.48 to 12:04:53 12:03.48 Deltastreamer started processing them 12:04:53 last file arrived to S3 12:05.20 deltastreamer finished data ingestion During this time 3 deltacommits were generated. **Expected behavior** Please can you help us understand why there is 1 delta commit in the first test and 3 delta commits in the second test with the same amount of input data? How delta commit is triggered - what are the criteria? **Environment Description** * Hudi version : 12.1 * Spark version : 3.3 * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : * Running on Docker? (yes/no) : **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org