[I] Hudi behaviour if AWS Glue concurrency is triggered[SUPPORT] [hudi]

via GitHub Wed, 24 Jan 2024 01:41:01 -0800


rishabhreply opened a new issue, #10559:
URL: https://github.com/apache/hudi/issues/10559


   **Describe the problem you faced**
   
   It is not a problem but rather a question that I could not find in FAQs. 
Please let me know if it is unacceptable to ask here.
   
   I have data coming in multiple files (let's say 10 files) for one table and 
all will have same value in partition_column. My setup is state machine with 
Glue parallelization enabled. Lets say I have set a batch size=2 and 
concurrency=5 in state machine, this will mean the state machine will trigger 5 
parallel glue job instances and give each instance 2 files to process. I am 
using **insert_overwrite** hudi method.
   
   Q1. In this setting how will Hudi work as not all glue job instances might 
finish at the same time? Will I see any Hudi errors? Or will it "overwrite" the 
data written by the glue job instances that finished earlier?
   
   
   **Environment Description**
   
   * Hudi version : 
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Hudi behaviour if AWS Glue concurrency is triggered[SUPPORT] [hudi]

Reply via email to