rishabhreply commented on issue #10559: URL: https://github.com/apache/hudi/issues/10559#issuecomment-1921742508
Sorry about that. Let me try to rephrase it. In S3 I have 10 files, I have a state machine consisting one glue job with Hudi parameters set and in particular, the partition_key will be same for all the files. The state machine has batch value set to 2 and max concurrency set to 5. In case, you are not aware of this, it means state machine will create 5 batches of 2 size and distribute it to 5 glue job instances. Now, 5 instances of the glue job is reading and will write to the same destination under the same partition. My question is will there be a discrepancy in the data written to the target bc of this parallelization in this setting? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
