Gianmarco,

I browse through your paper (congrats for the ICDE publication BTW!), and
here are some questions / comments on the algorithm:

1. One motivation of enabling key-based partitioned in Kafka is to achieve
per-key ordering, i.e. with all messages with the same key sent to the same
partition their ordering is preserved. However with "key-splitting" that
seems to break this guarantee and now messages with the same key may be
sent to 2 (or generally speaking many) partitions.

2. As for the local load estimation, there is a second mapping from
partitions (workers in your paper) to broker hosts beside the mapping from
keys to partitions, and not all broker hosts maintain each of the
partitions. For example, there are 4 brokers, and broker-1/2 each takes one
of the two partitions of topic A, while broker-3/4 each takes one of the
two partitions of topic B, etc.

I am wondering if those two issues can be resolved with the PKG framework?

Guozhang

On Sun, Apr 5, 2015 at 12:19 AM, Gianmarco De Francisci Morales <
g...@apache.org> wrote:

> Hi Jay,
>
> Thanks, that sounds a necessary step. I guess I expected something like
> that to be already there, at least internally.
> I created KAFKA-2092 to track the PKG integration.
>
> Cheers,
>
> --
> Gianmarco
>
> On 4 April 2015 at 23:50, Jay Kreps <jay.kr...@gmail.com> wrote:
>
> > Hey guys,
> >
> > I think the first step here would be to expose a partitioner interface
> for
> > the new producer that would make it easy to plug in these different
> > strategies. I filed a JIRA for this:
> > https://issues.apache.org/jira/browse/KAFKA-2091
> >
> > -Jay
> >
> > On Fri, Apr 3, 2015 at 9:36 AM, Harsha <ka...@harsha.io> wrote:
> >
> >> Gianmarco,
> >>                  I am coming from storm community. I think PKG is a very
> >> interesting and we can provide an implementation of Partitioner for PKG.
> >> Can you open a JIRA for this.
> >>
> >> --
> >> Harsha
> >> Sent with Airmail
> >>
> >> On April 3, 2015 at 4:49:15 AM, Gianmarco De Francisci Morales (
> >> g...@apache.org) wrote:
> >>
> >> Hi,
> >>
> >> We have recently studied the problem of load balancing in distributed
> >> stream processing systems such as Samza [1].
> >> In particular, we focused on what happens when the key distribution of
> the
> >> stream is skewed when using key grouping.
> >> We developed a new stream partitioning scheme (which we call Partial Key
> >> Grouping). It achieves better load balancing than hashing while being
> more
> >> scalable than round robin in terms of memory.
> >>
> >> In the paper we show a number of mining algorithms that are easy to
> >> implement with partial key grouping, and whose performance can benefit
> >> from
> >> it. We think that it might also be useful for a larger class of
> >> algorithms.
> >>
> >> PKG has already been integrated in Storm [2], and I would like to be
> able
> >> to use it in Samza as well. As far as I understand, Kafka producers are
> >> the
> >> ones that decide how to partition the stream (or Kafka topic). Even
> after
> >> doing a bit of reading, I am still not sure if I should be writing this
> >> email here or on the Samza dev list. Anyway, my first guess is Kafka.
> >>
> >> I do not have experience with Kafka, however partial key grouping is
> very
> >> easy to implement: it requires just a few lines of code in Java when
> >> implemented as a custom grouping in Storm [3].
> >> I believe it should be very easy to integrate.
> >>
> >> For all these reasons, I believe it will be a nice addition to
> >> Kafka/Samza.
> >> If the community thinks it's a good idea, I will be happy to offer
> support
> >> in the porting.
> >>
> >> References:
> >> [1]
> >>
> >>
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> >> [2] https://issues.apache.org/jira/browse/STORM-632
> >> [3] https://github.com/gdfm/partial-key-grouping
> >> --
> >> Gianmarco
> >>
> >
> >
>



-- 
-- Guozhang

Reply via email to