If you are used to map-reduce patterns, this sounds like a perfectly
natural way to process streams of data.

Call the first consumer "map-combine-log", the topic "shuffle-log" and
the second consumer "reduce-log" :)
I like that a lot. It works well for either "embarrassingly parallel"
cases, or "so much data that more parallelism is worth the extra
overhead" cases.

I personally don't care if its in core-Kafka, KIP-28 or a github
project elsewhere, but I find it useful and non-esoteric.



On Mon, Jul 27, 2015 at 12:51 PM, Jason Gustafson <ja...@confluent.io> wrote:
> For a little background, the difference between this partitioner and the
> default one is that it breaks the deterministic mapping from key to
> partition. Instead, messages for a given key can end up in either of two
> partitions. This means that the consumer generally won't see all messages
> for a given key. Instead the consumer would compute an aggregate for each
> key on the partitions it consumes and write them to a separate topic. For
> example, if you are writing log messages to a "logs" topic with the
> hostname as the key, you could this partitioning strategy to compute
> message counts for each host in each partition and write them to a
> "log-counts" topic. Then a consumer of the "log-counts" topic would compute
> total aggregates based on the two intermediate aggregates. The benefit is
> that you are generally going to get better load balancing across partitions
> than if you used the default partitioner. (Please correct me if my
> understanding is incorrect, Gianmarco)
>
> So I think the question is whether this is a useful primitive for Kafka to
> provide out of the box? I was a little concerned that this use case is a
> little esoteric for a core feature, but it may make more sense in the
> context of KIP-28 which would provide some higher-level processing
> capabilities (though it doesn't seem like the KStream abstraction would
> provide a direct way to leverage this partitioner without custom logic).
>
> Thanks,
> Jason
>
>
> On Wed, Jul 22, 2015 at 12:14 AM, Gianmarco De Francisci Morales <
> g...@apache.org> wrote:
>
>> Hello folks,
>>
>> I'd like to ask the community about its opinion on the partitioning
>> functions in Kafka.
>>
>> With KAFKA-2091 <https://issues.apache.org/jira/browse/KAFKA-2091>
>> integrated we are now able to have custom partitioners in the producer.
>> The question now becomes *which* partitioners should ship with Kafka?
>> This issue arose in the context of KAFKA-2092
>> <https://issues.apache.org/jira/browse/KAFKA-2092>, which implements a
>> specific load-balanced partitioning. This partitioner however assumes some
>> stages of processing on top of it to make proper use of the data, i.e., it
>> envisions Kafka as a substrate for stream processing, and not only as the
>> I/O component.
>> Is this a direction that Kafka wants to go towards? Or is this a role
>> better left to the internal communication systems of other stream
>> processing engines (e.g., Storm)?
>> And if the answer is the latter, how would something such a Samza (which
>> relies mostly on Kafka as its communication substrate) be able to implement
>> advanced partitioning schemes?
>>
>> Cheers,
>> --
>> Gianmarco
>>

Reply via email to