Hi, My team is interested in trying out Samza to augment or replace our hand-rolled Kafka-based stream processing system. I have a question about sharing memory across task instances.
Currently, our main stream processing application has some large, immutable objects that need to be loaded into JVM heap memory in order to process messages on any partition of certain topics. We use thread-based parallelism in our system, so that the Kafka consumer threads on each machine listening to these topics can use the same instance of these large heap objects. This is very convenient, as these objects are so large that storing multiple copies of them would be quite wasteful. To use Samza, it seems as though each JVM would have to store copies of these objects separately, even if we were to use LevelDB's off-heap storage - each JVM would eventually have to inflate the off-heap memory into regular Java objects to be usable. One solution to this problem could be using something like Google's Flatbuffers [0] for these large objects - so that we could use accessors on large, off-heap ByteBuffers without having to actually deserialize them. However, we think that doing this for all of the relevant data we have would be a lot of work. Have you guys considered implementing a thread-based parallelism model for Samza, whether as a replacement or alongside the current JVM-based parallelism approach? What obstacles are there to making this happen, assuming that decided not to do it? This approach would be invaluable for our use case, since we rely so heavily (perhaps unfortunately so) on these shared heap data structures. Thanks, Jordan Lewis [0]: http://google.github.io/flatbuffers/
