Sharing heap memory in Samza

Jordan Lewis Mon, 27 Oct 2014 12:53:58 -0700

Hi,

My team is interested in trying out Samza to augment or replace our
hand-rolled Kafka-based stream processing system. I have a question about
sharing memory across task instances.


Currently, our main stream processing application has some large, immutable
objects that need to be loaded into JVM heap memory in order to process
messages on any partition of certain topics. We use thread-based
parallelism in our system, so that the Kafka consumer threads on each
machine listening to these topics can use the same instance of these large
heap objects. This is very convenient, as these objects are so large that
storing multiple copies of them would be quite wasteful.

To use Samza, it seems as though each JVM would have to store copies of
these objects separately, even if we were to use LevelDB's off-heap storage
- each JVM would eventually have to inflate the off-heap memory into
regular Java objects to be usable. One solution to this problem could be
using something like Google's Flatbuffers [0] for these large objects - so
that we could use accessors on large, off-heap ByteBuffers without having
to actually deserialize them. However, we think that doing this for all of
the relevant data we have would be a lot of work.

Have you guys considered implementing a thread-based parallelism model for
Samza, whether as a replacement or alongside the current JVM-based
parallelism approach? What obstacles are there to making this happen,
assuming that decided not to do it? This approach would be invaluable for
our use case, since we rely so heavily (perhaps unfortunately so) on these
shared heap data structures.

Thanks,
Jordan Lewis

[0]: http://google.github.io/flatbuffers/

Sharing heap memory in Samza

Reply via email to