mathieu-amblard commented on PR #12705: URL: https://github.com/apache/kafka/pull/12705#issuecomment-1263687207
@dajac Thanks for your reply and sorry if I was unclear in my description. I forgot to mention something very important in my description, I would like to be thread safe as much as possible. I also would like to have multiple consumers (1 thread per consumer) for performance purpose. For business purposes, each topic contains only on type of object. Most of them must should not diverge too much to have a good data consistency (ie. we should not continue to consume one partition of one topic if one partition failed). Combining these hypothesis and requirements, I though that having as many consumers as topics is the best option if each partitions of a topic is assign to only one consumer. ------------------------------------- **First requirement : be thread safe as much as possible** If I use one of the four existing assignor, I always get mixed partitions assigned to a topic (ie. a partition of different topic) and so records of a same topic can be proceed concurrently. Let me take an example. Suppose, there are 2 consumers `C0` and `C1`, and two topics `t0` and `t1`, and each has 3 partitions, resulting in partitions `t0p0`, `t0p1`, `t0p2`, `tp1p0`, `t1p1` and `t1p2`. If I use the standard RoundRobinAssignor, the assignement will be : `C0: [t0p0, t0p2, t1p1]` `C1: [t0p1, t1p0, t1p2]` In that case, I must ensure that the processes executed when I received records from `t0` and `t1` must be thread safe because they can came from two different threads `C0` and `C1` in parallel. If I use the TopicRoundRobinAssignor, the assignement will be : `C0: [t0p0, t0p1, t0p2]` `C1: [t1p0, t1p1, t1p2]` In that case, each record record of a same topic is processed in the same thread so there are less risks of concurrency. -------------------------------------- **Second requirement : topic should not diverge too much (keep data consistency)** Again, if I use one of the four existing assignor, I always get mixed partitions assigned to a topic (ie. a partition of different topic). If one of the consumer fails, I kept in memory which partition failed in order to not propagate the "corrupted" record. Otherwise, after rebalancing, when the "corrupted" will be consumed by an other consumer, it will be stopped also, and so on, and so on... Let me take the same example. Suppose, there are 2 consumers `C0` and `C1`, and two topics `t0` and `t1`, and each has 3 partitions, resulting in partitions `t0p0`, `t0p1`, `t0p2`, `tp1p0`, `t1p1` and `t1p2`. If I use the standard RoundRobinAssignor, the assignement will be : `C0: [t0p0, t0p2, t1p1]` `C1: [t0p1, t1p0, t1p2]` If `C0` fails, it will be stopped and the new assignement will be : `C1: [t0p1, t1p0, t1p2, t0p0, t0p2, t1p1]` As I kept in memory that I do not want to consume `t0p0`, `t0p2`, `t1p1`, only records from `t0p1`, `t1p0`, `t1p2` will continue to be consumed. Therefore, if I do not fix the issue quickly about the "corrupted" record, the data from each topic will diverge (some records will be consumed, some other not). If I use the standard `TopicRoundRobinAssignor`, the assignment will be : `C0: [t0p0, t0p1, t0p2]` `C1: [t1p0, t1p1, t1p2]` If `C0` fails, it will be stopped and the new assignment will be : `C1: [t0p0, t0p1, t0p2, t1p0, t1p1, t1p2]` As I kept in memory that I do not want to consume `t0p0`, `t0p1`, `t0p2`, only records from `t0p1`, `t0p1`, `t0p2` will continue to be consumed. Therefore, I continue to consume all records of a same topic and my data stay consistent as much as possible. ------------------------------------------------ I hope this is more clear, this is not really simple to explain... Maybe my requirements are too specific and nobody else have the needs... I remain of course at your disposal for any questions or for more explanation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
