Alongside recent talk of integrating infinispan with hadoop batch processing, there has been some discussion of using the data grid alongside an event stream processing system.
There are several directions we could consider here. In approximate order of increasing complexity these are: - Allow bi-directional flow of events, such that listeners on the cache can be used to cause events in the processing engine, or events in the processing engine can update the cache. - Allow the cache to be used to hold lookup data for reference from user code running the processing engine, to speed up joining streamed events to what would otherwise be data tables on disk. - Integrate with the processing engine itself, such that infinispan can be used to store items that would otherwise occupy precious RAM. This one is probably only viable with the cooperation of the stream processing system, so I'll base further discussion on Drools Fusion. The engine uses memory for a) rules, i.e. processing logic. Some of this is infrequently accessed. Think of a decision tree in which some branches are traversed more than others. So, opportunities to swap bits out to cache perhaps. b) state, particularly sliding windows. Again some data is infrequently accessed. For many sliding window calculations in particular (e.g. running average), only the head and tail of the window are actually used. The events in-between can be swapped out. Of course these integrations require the stream processing engine to be written to support such operations - careful handling of object references is needed. Currently the engine doesn't work that way - everything is focussed on speed at the expense of memory. - Borrow some ideas from the event processing DSLs, such that the data grid query engine can independently support continuous (standing) queries rather than just one-off queries. Arguably this is reinventing the wheel, but for simple use cases it may be preferable to run the stream processing logic directly in the grid rather than deploying a dedicated event stream processing system. I think it's probably going to require supporting lists as a first class construct alongside maps though. There are various cludges possible here, including the brute force approach of faking continuous query by re-executing a one-off query on each mutation, but they tend to be inefficient. There is also the thorny problem of supporting a (potentially distributed) clock, since a lot of use cases need to reference the passage of time in the query e.g. 'send event to listener if avg in last N minutes > x'. Jonathan Halliday Core developer, JBoss. -- Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham (USA), Paul Hickey (Ireland), Matt Parson (USA), Charlie Peters (USA) _______________________________________________ infinispan-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/infinispan-dev
