Hey Alex, As I understand it, the CEP pattern you describing is, "look for a series of events within some bounded time frame, and take an action based on the combination of events." You use an example of three events arriving within 10 minutes of each other, consecutively. Wikipedia uses a similar example (wedding bell event + man in suit event + woman in white dress event + rice thrown event = wedding) on their CEP page.
This pattern can be implemented in Samza fairly easily using Samza's key/value store (or some other StorageEngine, if you choose to implement it). It's best to use a key/value store for this use case, since the window might be quite long (10 minutes), and all events in the window might not fit in memory. If you use Samza's key/value store, you can put each message (and a timestamp) into the key/value store as the messages arrive. You can then implement the WindowableTask interface along with the StreamTask interface, and configure Samza to call window() on your task every N seconds (say, task.window.ms=60000). The window method could then do a range query on the key/value store, and check for message chains (e.g. E1 -> E2 -> E3) that were last updated > 10 minutes ago. If an expected message was missing, you could then take some action (send an alert, or whatever). In general, when I think CEP, I think Esper (http://esper.codehaus.org/). You should be able to implement a lot of CEP/SQL type commands (SELECT, JOIN, COUNT, SUM, DISTINCT, WHERE, GROUP BY, HAVING, WINDOW, ORDER, etc) using Samza's StreamTask interface, and is state management facilities. Beyond state management, most features in Samza enable CEP processing, in one way or another. From your perspective, you can look at Samza as the underlying framework with which you might choose to implement a CEP type system (think MapReduce is to Hive as Samza is to a CEP system). Specific things that help are its WindowableTask interface, the partitioning model (which lends itself to distributed joins and aggregation), and Samza's state management features. One thing to be aware of right now is Samza's "at least once" messaging guarantee when failures occur (inherited from Kafka). You might receive duplicate messages. This means you can potentially double count, if you're doing aggregation. In the example you give (E1, E2, E3), this shouldn¹t be a problem. We have plans to provide exactly once messaging, but we haven't implemented the feature yet. Cheers, Chris On 8/24/13 12:05 PM, "Alex The Rocker" <[email protected]> wrote: >Hello, > >I just began to read about Samza, and I very excited about it (I was >warned >of its existence by Jay Kreps' post in Kafka users list, BTW). > >My first reaction is: are you guys using it at LinkedIn for applications >which lies in the CEP (Complex Event Processing) system domain? > >To be more specific, would stateful Samza tasks be used in order to >compute >complex states such as "event E1 is followed by E2 then by E3 with less >than 10 minutes interval between each event" ? > >I was looking at Storm for CEP, but as pointed out in Samza Storm page, >Storm leaves state management to the bolts code, whereas Samza has >"something". > >Beyond state management, what else would make Samza a good building block >for a CEP? Or a bad one? > >Thanks, >Alex.
