[GitHub] [druid] abhishekagarwal87 commented on pull request #14424: Add support to read from multiple kafka topics in same supervisor

via GitHub Tue, 27 Jun 2023 00:56:25 -0700


abhishekagarwal87 commented on PR #14424:
URL: https://github.com/apache/druid/pull/14424#issuecomment-1608986420


   > This is neat and would help with operational simplicity. Few thoughts and 
questions:
   > 
   > 1. Will multi-stream processing work if the `ioConfig` is different for 
the topics? For example, the input topics span distinct brokers because 
different teams own them, the shape of data on the streams is different, or 
they've security settings configured differently.
   The topics have to belong to the same cluster. The use-cases that I know of 
so far do not require reading from topics in different clusters. The shape of 
the data can be different, just like how it can be different for multiple 
partitions within a topic. This patch doesn't support a topic-specific tuning 
config. We can't have that anyway. A task doesn't really care what topic it is 
reading from. It just reads from a set of partitions. Those partitions can 
belong to same topic or different topics. 
   >    To that effect, have we considered an array of `ioConfig` instead of a 
comma-separated list of topics that share a single `ioConfig`? I don't know how 
backwards compatibility would work in this case, though.
   >    EDIT: It seems like [@kfaraz](https://github.com/kfaraz) asked 
something similar. Is it reasonable to say that the scope of this change is to 
support multi-stream processing only if the topics share the same properties -- 
i.e., `ioConfig`, `dataSchema`, etc are the same? For all other use cases, I 
wonder what effort it would take and if it the design is extensible in the 
future.
   There is no reason for dataSchema to be uniform. Imagine that the ingestion 
system is emitting some metrics to one topic and query system is emitting some 
metrics to another topic. These topics will likely have different columns. If 
you have decided to put these data in same datasource, you have already assumed 
that you are ok with merged schema. 
   > 2. What are the implications on kafka ingestion metrics and supervisor 
operations like reset offsets in the multi-stream supervisor mode?
   partition-id dimension value will change but otherwise, they are expected to 
work as it is. 
   > 3. Fwiw, Kinesis also supports [multi-streaming 
processing](https://aws.amazon.com/about-aws/whats-new/2020/10/kinesis-client-library-enables-multi-stream-processing/)
 by a single consumer starting `2.3.1`. Once we bump up the current kinesis 
client SDK version from `1.14.4` to the latest stable, we can add similar 
support for kinesis indexing as well :)
   Nice. I will look at that. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] abhishekagarwal87 commented on pull request #14424: Add support to read from multiple kafka topics in same supervisor

Reply via email to