If anyone interested, I did a quick PoC attempting to join two data sets using hello-samza as a starting point.
Points to note, I did it in Scala. Our target was to keep at least 1 hour window of resent data at any given point in time, i.e ~200,000,000 records/h throughput for the first data set (ad auction bids), ~20,000,000/h for another another data set (ad impressions). That way, we're not constrained by order of events as much and data streams can be quite out of sync in case of replay from archive storage. You can find PoC that runs on local Samza grid here https://github.com/staslos/samza-hello-samza/tree/imp_bid_join, or pull request not for merging, but just to keep changes in one place https://github.com/apache/samza-hello-samza/pull/6. Can't brush it up for proper merge with master, since I'm being pulled to other task, but at least it's not lost and someone can find it useful. See src/main/scala/README for details. I have another branch that runs on CDH at scale, but I think it's overkill for current topic. Anyway, if you don't mind Magnetic specific stuff (no legal obligations), it's here https://github.com/staslos/samza-hello-samza/tree/imp_bid_join_cdh Overall we were very impressed with Samza performance, it took just 30 containers (30 partitions on each Kafka topic) with default settings to do a reliable join on our Hadoop cluster. Just for the record, on Spark Streaming I was able to keep only a couple of minutes Bids window with lots of other constraints and workarounds. Samza is our way to go with large RT joins. Regards, Stan