Re: Samza Join example

Yi Pan Thu, 14 Jan 2016 09:47:59 -0800

Hi, Stanislov,

That's awesome! It would be great to have this integrated w/ Samza
tutorial. Would you mind to create a tutorial page for the join job
implementation in Samza?


Thanks a lot!

-Yi

On Thu, Jan 14, 2016 at 7:28 AM, Stanislav Los <[email protected]>
wrote:

> If anyone interested, I did a quick PoC attempting to join two data sets
> using hello-samza as a starting point.
>
> Points to note, I did it in Scala.
> Our target was to keep at least 1 hour window of resent data at any given
> point in time, i.e ~200,000,000 records/h throughput for the first data set
> (ad auction bids), ~20,000,000/h for another another data set (ad
> impressions). That way, we're not constrained by order of events as much
> and data streams can be quite out of sync in case of replay from archive
> storage.
>
> You can find PoC that runs on local Samza grid here
> https://github.com/staslos/samza-hello-samza/tree/imp_bid_join, or pull
> request not for merging, but just to keep changes in one place
> https://github.com/apache/samza-hello-samza/pull/6. Can't brush it up for
> proper merge with master, since I'm being pulled to other task, but at
> least it's not lost and someone can find it useful.
> See src/main/scala/README for details.
>
> I have another branch that runs on CDH at scale, but I think it's overkill
> for current topic. Anyway, if you don't mind Magnetic specific stuff (no
> legal obligations), it's here
> https://github.com/staslos/samza-hello-samza/tree/imp_bid_join_cdh
>
> Overall we were very impressed with Samza performance, it took just 30
> containers (30 partitions on each Kafka topic) with default settings to do
> a reliable join on our Hadoop cluster. Just for the record, on Spark
> Streaming I was able to keep only a couple of minutes Bids window with lots
> of other constraints and workarounds.
> Samza is our way to go with large RT joins.
>
> Regards,
> Stan
>

Re: Samza Join example

Reply via email to