Just following up—please let me know if any of you have recommendations for implementing the mentioned use case.
Thanks. On Tuesday, October 29, 2024 at 10:37:30 PM PDT, Anil Dasari <dasaria...@myyahoo.com> wrote: Hi Venkat,Thanks for the reply. Microbatching is a data processing technique where small batches of data are collected and processed together at regular intervals.However, I'm aiming to avoid traditional micro-batch processing by tagging records within a time window as a batch, allowing for near-real-time data processing. I’m currently exploring Flink for the following use case: 1. Group data by a time window and write it to S3 under the appropriate prefix.2. Generate metrics for the microbatch and, if needed, store them in S3.3. Send metrics to an external system to notify that Step 1 has been completed.If any part of the process fails, the entire microbatch step should be rolled back. Planning to implement two phase commit sink for Step 2 and 3. The primary challenge is tagging the record set with epoch time across all tasks within a window to utilize it in the sink process for creating committable splits, such as the processing time in the flink file sink. Thanks On Tuesday, October 29, 2024 at 09:40:40 PM PDT, Venkatakrishnan Sowrirajan <vsowr...@asu.edu> wrote: Can you share more details on what do you mean by micro-batching? Can you explain with an example to understand it better? Thanks Venkat On Tue, Oct 29, 2024, 1:22 PM Anil Dasari <dasaria...@myyahoo.com.invalid> wrote: > Hello team, > I apologize for reaching out on the dev mailing list. I'm working on > implementing micro-batching with near real-time processing. > I've seen similar questions in the Flink Slack channel and user mailing > list, but there hasn't been much discussion or feedback. Here are the > options I've explored: > 1. Windowing: This approach looked promising, but the flushing mechanism > requires record-level information checks, as window data isn't accessible > throughout the pipeline. > 2. Window + Trigger: This method buffers events until the trigger interval > is reached, which affects real-time processing; events are only processed > when the trigger occurs. > 3. Processing Time: The processing time is specific to each file writer, > resulting in inconsistencies across different task managers. > 4. Watermark: There’s no global watermark; it's specific to each source > task, and the initial watermark information (before the first watermark > event) isn't epoch-based. > I'm looking to write data grouped by time (micro-batch time). What’s the > best approach to achieve micro-batching in Flink? > Let me know if you have any questions. thanks. > Thanks. > >