Re: Broker side round robin on topic partitions when receiving messages

2020-06-15 Thread Vinicius Scheidegger
I understand your point of view... My requirement is *exact* balancing -
parts of our current flow have a consumption processing of around 5
minutes... (and this is an important/expensive part - because it's CPU and
memory intensive and we'd like to avoid queueing) so we need EQUAL load
balancing - and we need to know when we need to scale/descale.
If you pay attention I'm always saying equal load balancing with multiple
producers.
By that I mean: if I have 10 partitions in a topic and send 10 messages
from different producers I expect the load to be exactly divided, 1 message
in each partition.

I thought about some possible solutions using Kafka as-is, although there
is always a drawback.

What Kafka offers out of the box:
A) RoundRobinPartitioner - cyclic round robin internal to producer - tends
to producer N messages in each single partition before total the number of
partition is met (where N is equal to the number of producers). Drawback:
unequal balance over short periods of time (depending on the number of
producers, where the messages are coming from - which producer, etc).
B) DefaultPartitioner - hash of the key modulus total number of partitions
- if a random key is used, mathematically (think of big number of messages)
should be equally distributed - Drawback: unequal balance over short
periods of time.
Is that correct / do you agree?

Possible options we could think of:

1) A custom partitioner using shared memory between producers to decide the
next partition; Drawback - all producers would need to be within the shared
memory boundary.
2) Creating a single dummy consumer/producer with RoundRobinpartitioner
between two topics "in", where the real producers would send the message to
and "out" with multiple partitions where the real consumers would listen
to. Drawbacks: Single point of failure (ok, we could have a single
partition "in" an extra consumer within the same consumer group that could
take if the consumer/producer fails) - I believe we could go far here -
making design, maintainability, monitoring, etc worse from an architecture
point of view.

I didn't understand the atomic counter - I guess maybe would look like
number 1?
And maybe fanout, like number 2?

I believe we should be able to do perfect load balancing, 10 messages
received in a topic being distributed to 10 partitions - 20 messages, 20
partitions, no matter who generated them.
The thing is that currently the broker receives messages on the partition
level only. No way to send them on a topic level and redistribute

We are currently paying extra idle machines - my ideas are either:
i) make sure we are not missing something (maybe some of our assumptions
are wrong and we have easy ootb options)
ii) if we are not missing something going with option 1 (and limiting our
producers to be within the shared mem boundaries)
iii) Checking the feasibility(how hard would it be?)/acceptance of the
community of doing this in Kafka by submitting a KIP

Thanks once again!



On Mon, Jun 15, 2020 at 9:09 PM Colin McCabe  wrote:

> This is a bit frustrating since you keep saying that the load is not
> balanced, but the load actually is balanced, it's just balanced in an
> approximate fashion.  If you need exact balancing (for example, because
> you're creating a job scheduler or something), then you need to use a
> different strategy.  One example would be using an external atomic counter
> to determine what partition the producers should send the messages to.
> Another would be using a single consumer with fanout.  I think this is
> outside the scope of Kafka, at least if I understand the problem here (?)
>
> best,
> Colin
>
> On Mon, Jun 15, 2020, at 11:32, Vinicius Scheidegger wrote:
> > Hi Collin,
> >
> > One producer shouldn't need to know about the other to distribute the
> load
> > equally, but what Kafka has now is roughly equal...
> > If you have a single producer RounRobinPartitioner works fine, if you
> have
> > 10 producers you can have 7/8 messages in one partition while another
> > partition has none (producers are in sync - which happened a couple times
> > in our tests).
> >
> > Producer0 getNext() = partition0
> > Producer1 getNext() = partition0
> > Producer2 getNext() = partition0
> >
> > A link to some of our test data prints:
> > https://imgur.com/a/ha9OQMj
> >
> > This, depending on how intensive (slow) your consumption rate is, may be
> a
> > problem as it will generate enqueuing.
> > We use Kafka as a messaging protocol in a big (and in some points heavy
> > load) machine learning flow - for high throughput (lightweight
> processing)
> > enqueuing is not an issue - aƱthough we saw it happening. but for heavy
> > processes we are unable to do equal load balance.
> >
> > We currently use t

Re: Broker side round robin on topic partitions when receiving messages

2020-06-15 Thread Vinicius Scheidegger
Hi Collin,

One producer shouldn't need to know about the other to distribute the load
equally, but what Kafka has now is roughly equal...
If you have a single producer RounRobinPartitioner works fine, if you have
10 producers you can have 7/8 messages in one partition while another
partition has none (producers are in sync - which happened a couple times
in our tests).

Producer0 getNext() = partition0
Producer1 getNext() = partition0
Producer2 getNext() = partition0

A link to some of our test data prints:
https://imgur.com/a/ha9OQMj

This, depending on how intensive (slow) your consumption rate is, may be a
problem as it will generate enqueuing.
We use Kafka as a messaging protocol in a big (and in some points heavy
load) machine learning flow - for high throughput (lightweight processing)
enqueuing is not an issue - aƱthough we saw it happening. but for heavy
processes we are unable to do equal load balance.

We currently use the DefaultPartitioner and Kafka algorithm (murmur2 hash
of the key) to decide the partition.
We noticed enqueuing and timeouts while several consumers were idle - which
made us take a better look on how the load is balanced.

I believe the only way to perform equal load balance without having to know
other producers would be to do it on the Broker side. Do you agree?

Thanks,



On Mon, Jun 15, 2020 at 7:32 PM Colin McCabe  wrote:

> Hi Vinicius,
>
> It's actually not necessary for one producer to know about the others to
> get an even distribution across partitions, right?  All that's really
> required is that all producers produce a roughly equal amount of data to
> each partition, which is what RoundRobinPartitioner is designed to do.  In
> mathematical terms, the sum of several uniform random variables is itself
> uniformly random.
>
> (There is a bug in RRP right now, KAFKA-9965, but it's not related to what
> we're talking about now and we have a fix ready.)
>
> cheers,
> Colin
>
>
> On Sun, Jun 14, 2020, at 14:26, Vinicius Scheidegger wrote:
> > Hi Collin,
> >
> > Thanks for the reply. Actually the RoundRobinPartitioner won't do an
> equal
> > distribution when working with multiple producers. One producer does not
> > know the others. If you consider that producers are randomly producing
> > messages, in the worst case scenario all producers can be synced and one
> > could have as many messages in a single partition as the number of
> > producers.
> > It's easy to generate evidences of it.
> >
> > I have asked this question on the users mail list too (and on Slack and
> on
> > Stackoverflow).
> >
> > Kafka currently does not have means to do a round robin across multiple
> > producers or on the broker side.
> >
> > This means there is currently NO GUARANTEE of equal distribution across
> > partitions as the partition election is decided by the producer.
> >
> > There result is an unbalanced consumption when working with consumer
> groups
> > and the options are: creating a custom shared partitioner, relying on
> Kafka
> > random partition or introducing a middle man between topics (all of them
> > having big cons).
> >
> > I thought of asking here to see whether this is a topic that could
> concern
> > other developers (and maybe understand whether this could be a KIP
> > discussion)
> >
> > Maybe I'm missing something... I would like to know.
> >
> > According to my interpretation of the code (just read through some
> > classes), but there is currently no way to do partition balancing on the
> > broker - the producer sends messages directly to partition leaders so
> > partition currently needs to be defined on the producer.
> >
> > I understand that in order to perform round robin across partitions of a
> > topic when working with multiple producers, some development needs to be
> > done. Am I right?
> >
> >
> > Thanks
> >
> >
> > On Fri, Jun 12, 2020, 10:57 PM Colin McCabe  wrote:
> >
> > > HI Vinicius,
> > >
> > > This question seems like a better fit for the user mailing list rather
> > > than the developer mailing list.
> > >
> > > Anyway, if I understand correctly, you are asking if the producer can
> > > choose to assign partitions in a round-robin fashion rather than based
> on
> > > the key.  The answer is, you can, by using RoundRobinPartitioner.
> (again,
> > > if I'm understanding the question correctly).
> > >
> > > best,
> > > Colin
> > >
> > > On Tue, Jun 9, 2020, at 00:48, Vinicius Scheidegger wrote:
> > > > Anyone?
> > > >
> > > > On Fri, Jun 5, 2020 at 2:42 P

Re: Broker side round robin on topic partitions when receiving messages

2020-06-14 Thread Vinicius Scheidegger
Hi Collin,

Thanks for the reply. Actually the RoundRobinPartitioner won't do an equal
distribution when working with multiple producers. One producer does not
know the others. If you consider that producers are randomly producing
messages, in the worst case scenario all producers can be synced and one
could have as many messages in a single partition as the number of
producers.
It's easy to generate evidences of it.

I have asked this question on the users mail list too (and on Slack and on
Stackoverflow).

Kafka currently does not have means to do a round robin across multiple
producers or on the broker side.

This means there is currently NO GUARANTEE of equal distribution across
partitions as the partition election is decided by the producer.

There result is an unbalanced consumption when working with consumer groups
and the options are: creating a custom shared partitioner, relying on Kafka
random partition or introducing a middle man between topics (all of them
having big cons).

I thought of asking here to see whether this is a topic that could concern
other developers (and maybe understand whether this could be a KIP
discussion)

Maybe I'm missing something... I would like to know.

According to my interpretation of the code (just read through some
classes), but there is currently no way to do partition balancing on the
broker - the producer sends messages directly to partition leaders so
partition currently needs to be defined on the producer.

I understand that in order to perform round robin across partitions of a
topic when working with multiple producers, some development needs to be
done. Am I right?


Thanks


On Fri, Jun 12, 2020, 10:57 PM Colin McCabe  wrote:

> HI Vinicius,
>
> This question seems like a better fit for the user mailing list rather
> than the developer mailing list.
>
> Anyway, if I understand correctly, you are asking if the producer can
> choose to assign partitions in a round-robin fashion rather than based on
> the key.  The answer is, you can, by using RoundRobinPartitioner. (again,
> if I'm understanding the question correctly).
>
> best,
> Colin
>
> On Tue, Jun 9, 2020, at 00:48, Vinicius Scheidegger wrote:
> > Anyone?
> >
> > On Fri, Jun 5, 2020 at 2:42 PM Vinicius Scheidegger <
> > vinicius.scheideg...@gmail.com> wrote:
> >
> > > Does anyone know how could I perform a load balance to distribute
> equally
> > > the messages to all consumers within the same consumer group having
> > > multiple producers?
> > >
> > > Is this a conceptual flaw on Kafka, wasn't it thought for equal
> > > distribution with multiple producers or am I missing something?
> > > I've asked on Stack Overflow, on Kafka users mailing group, here (on
> Kafka
> > > Devs) and on Slack - and still have no definitive answer (actually
> most of
> > > the time I got no answer at all)
> > >
> > > Would something like this even be possible in the way Kafka is
> currently
> > > designed?
> > > How does proposing for a KIP work?
> > >
> > > Thanks,
> > >
> > >
> > >
> > > On Thu, May 28, 2020, 3:44 PM Vinicius Scheidegger <
> > > vinicius.scheideg...@gmail.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> I'm trying to understand a little bit more about how Kafka works.
> > >> I have a design with multiple producers writing to a single topic and
> > >> multiple consumers in a single Consumer Group consuming message from
> this
> > >> topic.
> > >>
> > >> My idea is to distribute the messages from all producers equally. From
> > >> reading the documentation I understood that the partition is always
> > >> selected by the producer. Is that correct?
> > >>
> > >> I'd also like to know if there is an out of the box option to assign
> the
> > >> partition via a round robin *on the broker side *to guarantee equal
> > >> distribution of the load - if possible to each consumer, but if not
> > >> possible, at least to each partition.
> > >>
> > >> If my understanding is correct, it looks like in a multiple producer
> > >> scenario there is lack of support from Kafka regarding load balancing
> and
> > >> customers have to either stick to the hash of the key (random
> distribution,
> > >> although it would guarantee same key goes to the same partition) or
> they
> > >> have to create their own logic on the producer side (i.e. by sharing
> memory)
> > >>
> > >> Am I missing something?
> > >>
> > >> Thank you,
> > >>
> > >> Vinicius Scheidegger
> > >>
> > >
> >
>


Re: Broker side round robin on topic partitions when receiving messages

2020-06-09 Thread Vinicius Scheidegger
Anyone?

On Fri, Jun 5, 2020 at 2:42 PM Vinicius Scheidegger <
vinicius.scheideg...@gmail.com> wrote:

> Does anyone know how could I perform a load balance to distribute equally
> the messages to all consumers within the same consumer group having
> multiple producers?
>
> Is this a conceptual flaw on Kafka, wasn't it thought for equal
> distribution with multiple producers or am I missing something?
> I've asked on Stack Overflow, on Kafka users mailing group, here (on Kafka
> Devs) and on Slack - and still have no definitive answer (actually most of
> the time I got no answer at all)
>
> Would something like this even be possible in the way Kafka is currently
> designed?
> How does proposing for a KIP work?
>
> Thanks,
>
>
>
> On Thu, May 28, 2020, 3:44 PM Vinicius Scheidegger <
> vinicius.scheideg...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm trying to understand a little bit more about how Kafka works.
>> I have a design with multiple producers writing to a single topic and
>> multiple consumers in a single Consumer Group consuming message from this
>> topic.
>>
>> My idea is to distribute the messages from all producers equally. From
>> reading the documentation I understood that the partition is always
>> selected by the producer. Is that correct?
>>
>> I'd also like to know if there is an out of the box option to assign the
>> partition via a round robin *on the broker side *to guarantee equal
>> distribution of the load - if possible to each consumer, but if not
>> possible, at least to each partition.
>>
>> If my understanding is correct, it looks like in a multiple producer
>> scenario there is lack of support from Kafka regarding load balancing and
>> customers have to either stick to the hash of the key (random distribution,
>> although it would guarantee same key goes to the same partition) or they
>> have to create their own logic on the producer side (i.e. by sharing memory)
>>
>> Am I missing something?
>>
>> Thank you,
>>
>> Vinicius Scheidegger
>>
>


Re: Broker side round robin on topic partitions when receiving messages

2020-06-05 Thread Vinicius Scheidegger
Does anyone know how could I perform a load balance to distribute equally
the messages to all consumers within the same consumer group having
multiple producers?

Is this a conceptual flaw on Kafka, wasn't it thought for equal
distribution with multiple producers or am I missing something?
I've asked on Stack Overflow, on Kafka users mailing group, here (on Kafka
Devs) and on Slack - and still have no definitive answer (actually most of
the time I got no answer at all)

Would something like this even be possible in the way Kafka is currently
designed?
How does proposing for a KIP work?

Thanks,



On Thu, May 28, 2020, 3:44 PM Vinicius Scheidegger <
vinicius.scheideg...@gmail.com> wrote:

> Hi,
>
> I'm trying to understand a little bit more about how Kafka works.
> I have a design with multiple producers writing to a single topic and
> multiple consumers in a single Consumer Group consuming message from this
> topic.
>
> My idea is to distribute the messages from all producers equally. From
> reading the documentation I understood that the partition is always
> selected by the producer. Is that correct?
>
> I'd also like to know if there is an out of the box option to assign the
> partition via a round robin *on the broker side *to guarantee equal
> distribution of the load - if possible to each consumer, but if not
> possible, at least to each partition.
>
> If my understanding is correct, it looks like in a multiple producer
> scenario there is lack of support from Kafka regarding load balancing and
> customers have to either stick to the hash of the key (random distribution,
> although it would guarantee same key goes to the same partition) or they
> have to create their own logic on the producer side (i.e. by sharing memory)
>
> Am I missing something?
>
> Thank you,
>
> Vinicius Scheidegger
>


Broker side round robin on topic partitions when receiving messages

2020-05-28 Thread Vinicius Scheidegger
Hi,

I'm trying to understand a little bit more about how Kafka works.
I have a design with multiple producers writing to a single topic and
multiple consumers in a single Consumer Group consuming message from this
topic.

My idea is to distribute the messages from all producers equally. From
reading the documentation I understood that the partition is always
selected by the producer. Is that correct?

I'd also like to know if there is an out of the box option to assign the
partition via a round robin *on the broker side *to guarantee equal
distribution of the load - if possible to each consumer, but if not
possible, at least to each partition.

If my understanding is correct, it looks like in a multiple producer
scenario there is lack of support from Kafka regarding load balancing and
customers have to either stick to the hash of the key (random distribution,
although it would guarantee same key goes to the same partition) or they
have to create their own logic on the producer side (i.e. by sharing memory)

Am I missing something?

Thank you,

Vinicius Scheidegger