Hi Martin,

For me the value in joining imp and click streams is more in the last point you 
mentioned in terms of enrichment as opposed to CTR counting. Note though that 
(in my case at least) the aim isn't to enrich the click event with data from 
the imp but the reverse.

My ad servers generate literally dozens of discrete event streams most of which 
are very related to internal workings but which can usually be associated with 
an impression event. For me the value in joining these streams is to create 
much richer composite events that can then be pushed to other tasks which can 
be doing things such as CTR counting (aggregation as you say) but also can 
enable neartime processing on the composite records. That latter is 
particularly useful in regards to health/alert type behaviours.

The various related events usually are quite temporally adjacent but things 
like clicks can trail behind which is where the stateful aspects are useful. 
Though a strategy to deal with stragglers is always needed.

I appreciate this is maybe very industry specific; perhaps generalizing the 
example to discuss the grouping of related system events into a  richer 
composite record in a way that is both more resilient and not dependent on 
in-memory buffering is the useful example to be drawn here?

Regards
Garry

-----Original Message-----
From: Martin Kleppmann [mailto:[email protected]] 
Sent: 31 May 2014 19:19
To: Samza dev list
Subject: Example use cases for stream/stream join

I am currently rewriting this page: 
http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html
which includes a few examples of use cases where stateful processing is needed. 
Most of the examples are ok, but there's one which I don't find credible: the 
stream/stream join.

The example given is: you have a stream of ad impressions and a stream of ad 
clicks, and you want to join each click with its corresponding impression so 
that you can calculate the click-through rate. Unfortunately the example 
suffers from a few flaws:

- To calculate the CTR, you don't actually need to join individual events. You 
only need to count clicks and impressions (perhaps grouped by various 
dimensions in an OLAP cube). Clicks and impressions can be counted 
independently, so this is really an aggregation example, not a stream join 
example.

- You could argue that you need to join individual events because you want to 
include attributes of the impression in the analysis of the clicks (e.g. 
timestamp of impression). However, such attributes of the impression can be 
directly included in the click event (whatever tracks the click can remember 
the attributes of the impression, e.g. encoded in an URL). That would be much 
simpler than trying to join the streams after the fact.

Could someone enlighten me why the join of ad clicks and ad impressions is 
necessary? Or if not, does someone have a compelling and easy-to-understand 
example of stream/stream joins that I could include in the docs? I'm struggling 
to think of one myself, even though it must exist...

Thanks,
Martin


-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2014.0.4570 / Virus Database: 3955/7589 - Release Date: 05/30/14

Reply via email to