If anyone interested, I did a quick PoC attempting to join two data sets
using hello-samza as a starting point.

Points to note, I did it in Scala.
Our target was to keep at least 1 hour window of resent data at any given
point in time, i.e ~200,000,000 records/h throughput for the first data set
(ad auction bids), ~20,000,000/h for another another data set (ad
impressions). That way, we're not constrained by order of events as much
and data streams can be quite out of sync in case of replay from archive
storage.

You can find PoC that runs on local Samza grid here
https://github.com/staslos/samza-hello-samza/tree/imp_bid_join, or pull
request not for merging, but just to keep changes in one place
https://github.com/apache/samza-hello-samza/pull/6. Can't brush it up for
proper merge with master, since I'm being pulled to other task, but at
least it's not lost and someone can find it useful.
See src/main/scala/README for details.

I have another branch that runs on CDH at scale, but I think it's overkill
for current topic. Anyway, if you don't mind Magnetic specific stuff (no
legal obligations), it's here
https://github.com/staslos/samza-hello-samza/tree/imp_bid_join_cdh

Overall we were very impressed with Samza performance, it took just 30
containers (30 partitions on each Kafka topic) with default settings to do
a reliable join on our Hadoop cluster. Just for the record, on Spark
Streaming I was able to keep only a couple of minutes Bids window with lots
of other constraints and workarounds.
Samza is our way to go with large RT joins.

Regards,
Stan

Reply via email to