Hi, Stanislov, That's awesome! It would be great to have this integrated w/ Samza tutorial. Would you mind to create a tutorial page for the join job implementation in Samza?
Thanks a lot! -Yi On Thu, Jan 14, 2016 at 7:28 AM, Stanislav Los <[email protected]> wrote: > If anyone interested, I did a quick PoC attempting to join two data sets > using hello-samza as a starting point. > > Points to note, I did it in Scala. > Our target was to keep at least 1 hour window of resent data at any given > point in time, i.e ~200,000,000 records/h throughput for the first data set > (ad auction bids), ~20,000,000/h for another another data set (ad > impressions). That way, we're not constrained by order of events as much > and data streams can be quite out of sync in case of replay from archive > storage. > > You can find PoC that runs on local Samza grid here > https://github.com/staslos/samza-hello-samza/tree/imp_bid_join, or pull > request not for merging, but just to keep changes in one place > https://github.com/apache/samza-hello-samza/pull/6. Can't brush it up for > proper merge with master, since I'm being pulled to other task, but at > least it's not lost and someone can find it useful. > See src/main/scala/README for details. > > I have another branch that runs on CDH at scale, but I think it's overkill > for current topic. Anyway, if you don't mind Magnetic specific stuff (no > legal obligations), it's here > https://github.com/staslos/samza-hello-samza/tree/imp_bid_join_cdh > > Overall we were very impressed with Samza performance, it took just 30 > containers (30 partitions on each Kafka topic) with default settings to do > a reliable join on our Hadoop cluster. Just for the record, on Spark > Streaming I was able to keep only a couple of minutes Bids window with lots > of other constraints and workarounds. > Samza is our way to go with large RT joins. > > Regards, > Stan >
