Hello. My team is looking into Samza for doing real-time processing. We would like to run a directed graph of jobs, where the records in each job's input streams are joined on a common key. We have functionality to perform the join by buffering records from the input streams until certain conditions are met and then passing them on.
We are wondering about the best way to integrate this functionality into a Samza job. After looking over the API we see two possibilities: 1. Use a StreamTask that adds records to a buffer. This is the method that the "ad event" example uses. But we am concerned that the framework commits a StreamTask's offset after process() completes, so if the job fails, records in the buffer are permanently lost. 2. Use an AsyncTask that adds records to a buffer. Also add TaskCallbacks to the buffer. When records are eventually joined and processed, commit their callbacks. This method seems promising but it requires setting task.max.concurrency very high - possibly in the tens of thousands in our case. Are we likely to run into any issues doing that? Are there any other options that we overlooked? What is the best approach? -Jef G