mathieu-amblard commented on PR #12705:
URL: https://github.com/apache/kafka/pull/12705#issuecomment-1263687207

   @dajac Thanks for your reply and sorry if I was unclear in my description.
   
   I forgot to mention something very important in my description, I would like 
to be thread safe as much as possible.
   I also would like to have multiple consumers (1 thread per consumer) for 
performance purpose.
   For business purposes, each topic contains only on type of object.
   Most of them must should not diverge too much to have a good data 
consistency (ie. we should not continue to consume one partition of one topic 
if one partition failed).
   Combining these hypothesis and requirements, I though that having as many 
consumers as topics is the best option if each partitions of a topic is assign 
to only one consumer.
   
   -------------------------------------
   
   **First requirement : be thread safe as much as possible**
   
   If I use one of the four existing assignor, I always get mixed partitions 
assigned to a topic (ie. a partition of different topic) and so records of a 
same topic can be proceed concurrently.
   
   Let me take an example.
   Suppose, there are 2 consumers `C0` and `C1`, and two topics `t0` and `t1`, 
and each has 3 partitions, resulting in partitions `t0p0`, `t0p1`, `t0p2`, 
`tp1p0`, `t1p1` and `t1p2`.
   
   If I use the standard RoundRobinAssignor, the assignement will be :
   `C0: [t0p0, t0p2, t1p1]`
   `C1: [t0p1, t1p0, t1p2]`
   
   In that case, I must ensure that the processes executed when I received 
records from `t0` and `t1` must be thread safe because they can came from two 
different threads `C0` and `C1` in parallel.
   
   If I use the TopicRoundRobinAssignor, the assignement will be :
   `C0: [t0p0, t0p1, t0p2]`
   `C1: [t1p0, t1p1, t1p2]`
   
   In that case, each record record of a same topic is processed in the same 
thread so there are less risks of concurrency.
   
   --------------------------------------
   
   **Second requirement : topic should not diverge too much (keep data 
consistency)**
   
   Again, if I use one of the four existing assignor, I always get mixed 
partitions assigned to a topic (ie. a partition of different topic).
   If one of the consumer fails, I kept in memory which partition failed in 
order to not propagate the "corrupted" record.
   Otherwise, after rebalancing, when the "corrupted" will be consumed by an 
other consumer, it will be stopped also, and so on, and so on...
   
   Let me take the same example.
   Suppose, there are 2 consumers `C0` and `C1`, and two topics `t0` and `t1`, 
and each has 3 partitions, resulting in partitions `t0p0`, `t0p1`, `t0p2`, 
`tp1p0`, `t1p1` and `t1p2`.
   
   If I use the standard RoundRobinAssignor, the assignement will be :
   `C0: [t0p0, t0p2, t1p1]`
   `C1: [t0p1, t1p0, t1p2]`
   
   If `C0` fails, it will be stopped and the new assignement will be :
   `C1: [t0p1, t1p0, t1p2, t0p0, t0p2, t1p1]`
   
   As I kept in memory that I do not want to consume `t0p0`, `t0p2`, `t1p1`, 
only records from `t0p1`, `t1p0`, `t1p2` will continue to be consumed.
   Therefore, if I do not fix the issue quickly about the "corrupted" record, 
the data from each topic will diverge (some records will be consumed, some 
other not).
   
   If I use the standard `TopicRoundRobinAssignor`, the assignment will be :
   `C0: [t0p0, t0p1, t0p2]`
   `C1: [t1p0, t1p1, t1p2]`
   
   If `C0` fails, it will be stopped and the new assignment will be :
   `C1: [t0p0, t0p1, t0p2, t1p0, t1p1, t1p2]`
   
   As I kept in memory that I do not want to consume `t0p0`, `t0p1`, `t0p2`, 
only records from `t0p1`, `t0p1`, `t0p2` will continue to be consumed.
   Therefore, I continue to consume all records of a same topic and my data 
stay consistent as much as possible.
   
   ------------------------------------------------
   
   I hope this is more clear, this is not really simple to explain... Maybe my 
requirements are too specific and nobody else have the needs...
   I remain of course at your disposal for any questions or for more 
explanation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to