tatiana-rackspace opened a new issue, #8085:
URL: https://github.com/apache/hudi/issues/8085

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Can you please help us understand - when delta commit is triggered  for MoR 
tables - what are the criteria? Is it  by number of records or by number of 
seconds?
   
   **To Reproduce**
   We are running delta streamer on EMR to ingest files from S3.
   
   deltastreamer config: 
   ```
   spark-submit \
   --jars 
/usr/lib/hudi/hudi-utilities-bundle.jar,/usr/lib/hudi/hudi-aws-bundle.jar \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
/usr/lib/hudi/hudi-utilities-bundle.jar \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
   --source-ordering-field ts \
   --target-base-path s3a://hudi-test-table/deltastreamer_test_npartitioned/ \
   --target-table deltastreamer_test_npartitioned \
   --enable-sync \
   --sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool \
   --table-type MERGE_ON_READ \
   --op UPSERT \
   --continuous \
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://hudi-test-s3-target/parquet/public/users_cdc_test/2023/03/01/
 \
   --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
   --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
   --hoodie-conf hoodie.datasource.hive_sync.database=hudideltastreamer \
   --hoodie-conf 
hoodie.datasource.hive_sync.table=deltastreamer_test_npartitioned \
   --hoodie-conf hoodie.datasource.write.recordkey.field=user_id \
   --hoodie-conf hoodie.datasource.write.partitionpath.field="" \
   --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
 \
   --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
   ```
   
   
   Trying to test how it ingests 1000 small files(around 10 Kb each) from S3 
(inserted 80000 rows into table).
   1.Test:
   Files are generated and placed to S3 first.
   Deltastreamer starts after all files are there, ingests them and  produces 
single delta commit.
   
   2.Test: 
   The same amount of data -  1000 small files(around 10 Kb each). 
   Files are generated and placed to S3 and deltastreamer  running at the same 
time in continuous mode.
   Timeline:
   12:00 Deltastreamer started running in continuous mode
   12:03.48  Parquet files started arriving to s3 from  12:03.48    to 12:04:53 
   12:03.48 Deltastreamer started processing them
   12:04:53  last file arrived to S3
   12:05.20  deltastreamer finished data ingestion
   
   During this time 3 deltacommits were generated.
   
   **Expected behavior**
   
   Please can you help us understand why there is 1 delta commit in the first 
test  and 3 delta commits in the second test with the same amount of input 
data? How  delta commit is triggered - what are the criteria? 
   
   **Environment Description**
   
   * Hudi version : 12.1
   
   * Spark version : 3.3
   
   * Hive version : 
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to