I am currently rewriting this page: http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html which includes a few examples of use cases where stateful processing is needed. Most of the examples are ok, but there's one which I don't find credible: the stream/stream join.
The example given is: you have a stream of ad impressions and a stream of ad clicks, and you want to join each click with its corresponding impression so that you can calculate the click-through rate. Unfortunately the example suffers from a few flaws: - To calculate the CTR, you don't actually need to join individual events. You only need to count clicks and impressions (perhaps grouped by various dimensions in an OLAP cube). Clicks and impressions can be counted independently, so this is really an aggregation example, not a stream join example. - You could argue that you need to join individual events because you want to include attributes of the impression in the analysis of the clicks (e.g. timestamp of impression). However, such attributes of the impression can be directly included in the click event (whatever tracks the click can remember the attributes of the impression, e.g. encoded in an URL). That would be much simpler than trying to join the streams after the fact. Could someone enlighten me why the join of ad clicks and ad impressions is necessary? Or if not, does someone have a compelling and easy-to-understand example of stream/stream joins that I could include in the docs? I'm struggling to think of one myself, even though it must exist... Thanks, Martin
