[
https://issues.apache.org/jira/browse/SAMZA-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106203#comment-15106203
]
Yi Pan (Data Infrastructure) commented on SAMZA-859:
----------------------------------------------------
[~staslos] has already started a nice example. We should finish it. Below is
the mail list conversation on this join example:
{noformat}
If anyone interested, I did a quick PoC attempting to join two data sets
using hello-samza as a starting point.
Points to note, I did it in Scala.
Our target was to keep at least 1 hour window of resent data at any given
point in time, i.e ~200,000,000 records/h throughput for the first data set
(ad auction bids), ~20,000,000/h for another another data set (ad
impressions). That way, we're not constrained by order of events as much
and data streams can be quite out of sync in case of replay from archive
storage.
You can find PoC that runs on local Samza grid here
https://github.com/staslos/samza-hello-samza/tree/imp_bid_join, or pull
request not for merging, but just to keep changes in one place
https://github.com/apache/samza-hello-samza/pull/6. Can't brush it up for
proper merge with master, since I'm being pulled to other task, but at
least it's not lost and someone can find it useful.
See src/main/scala/README for details.
I have another branch that runs on CDH at scale, but I think it's overkill
for current topic. Anyway, if you don't mind Magnetic specific stuff (no
legal obligations), it's here
https://github.com/staslos/samza-hello-samza/tree/imp_bid_join_cdh
Overall we were very impressed with Samza performance, it took just 30
containers (30 partitions on each Kafka topic) with default settings to do
a reliable join on our Hadoop cluster. Just for the record, on Spark
Streaming I was able to keep only a couple of minutes Bids window with lots
of other constraints and workarounds.
Samza is our way to go with large RT joins.
Regards,
Stan
{noformat}
> Create a simple join example in hello-samza tutorial
> ----------------------------------------------------
>
> Key: SAMZA-859
> URL: https://issues.apache.org/jira/browse/SAMZA-859
> Project: Samza
> Issue Type: Bug
> Reporter: Yi Pan (Data Infrastructure)
>
> It is very useful and a common question from the dev list as for "how do I do
> join in Samza". We even have some todos in the source code to complete it. We
> should push forward to make this an official use case in hello-samza and the
> tutorial page.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)