Hi all, I appreciate any advice on how best to do this:
I have a very small dimension table that I want to join with a large-volume event stream. Partitioning the event stream for the join seems overly complicated for this use case. What I really want is for all tasks of my stream join job to consume all partitions of the dimension stream. I see that there's an issue open for this and that it's not supported yet: https://issues.apache.org/jira/browse/SAMZA-353 As a work around, I was thinking of creating another topic for the dimension stream with an equal number of partitions as the event stream and having each partition of that stream contain a full copy of the dimension table. To do this, I think I need a job to copy (fan out) messages from the single upstream partition of the dimension stream to all partitions of this new topic. The issue is that the Samza API for OutgoingMessageEnvelope doesn't let me specify a partition id for an outbound message, only a partition key. What's the best way to ensure that a given message goes to a particular partition id? 1) Take advantage of the Kafka default partitioner using the key object's hashCode and make the key object hash to the partition id. Is this too brittle in relying on the DefaultPartitioner implementation of the old Kafka producer API? 2) Create a custom Kafka partitioner and enable it using the "partitioner.class" setting? 3) Is there another way? Thanks, Roger
