abhishekagarwal87 commented on PR #14424: URL: https://github.com/apache/druid/pull/14424#issuecomment-1608986420
> This is neat and would help with operational simplicity. Few thoughts and questions: > > 1. Will multi-stream processing work if the `ioConfig` is different for the topics? For example, the input topics span distinct brokers because different teams own them, the shape of data on the streams is different, or they've security settings configured differently. The topics have to belong to the same cluster. The use-cases that I know of so far do not require reading from topics in different clusters. The shape of the data can be different, just like how it can be different for multiple partitions within a topic. This patch doesn't support a topic-specific tuning config. We can't have that anyway. A task doesn't really care what topic it is reading from. It just reads from a set of partitions. Those partitions can belong to same topic or different topics. > To that effect, have we considered an array of `ioConfig` instead of a comma-separated list of topics that share a single `ioConfig`? I don't know how backwards compatibility would work in this case, though. > EDIT: It seems like [@kfaraz](https://github.com/kfaraz) asked something similar. Is it reasonable to say that the scope of this change is to support multi-stream processing only if the topics share the same properties -- i.e., `ioConfig`, `dataSchema`, etc are the same? For all other use cases, I wonder what effort it would take and if it the design is extensible in the future. There is no reason for dataSchema to be uniform. Imagine that the ingestion system is emitting some metrics to one topic and query system is emitting some metrics to another topic. These topics will likely have different columns. If you have decided to put these data in same datasource, you have already assumed that you are ok with merged schema. > 2. What are the implications on kafka ingestion metrics and supervisor operations like reset offsets in the multi-stream supervisor mode? partition-id dimension value will change but otherwise, they are expected to work as it is. > 3. Fwiw, Kinesis also supports [multi-streaming processing](https://aws.amazon.com/about-aws/whats-new/2020/10/kinesis-client-library-enables-multi-stream-processing/) by a single consumer starting `2.3.1`. Once we bump up the current kinesis client SDK version from `1.14.4` to the latest stable, we can add similar support for kinesis indexing as well :) Nice. I will look at that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
