[I] Multiple upstream source ingestion support on Pinot [pinot]

via GitHub Thu, 08 Aug 2024 11:54:37 -0700


lnbest0707-uber opened a new issue, #13780:
URL: https://github.com/apache/pinot/issues/13780


   Pinot nowadays only supports realtime table ingested from one single source 
stream, e.g. one Kafka topic from a Kafka cluster. And inside the table 
manager, the internal segment partition concept is hard coupled with the 
stream's partition. For example, if Kafka topic has 8 partitions, then Pinot 
table segments are also partitioned by 8, and each segment is consuming from 
the Kafka topic partition with the exact same partition id.
   This is a workable and simple design which could fit most of straightforward 
use cases. But it also imposes the flexibilities on ingestions.
   In reality, users may produce data of same subject to different Kafka topics 
and ingest to a single Pinot table (with same Schema) to do centralized 
analysis. There was one Pinot open issue asking for the feature 
https://github.com/apache/pinot/issues/5647. Other OLAP technologies, e.g. 
Clickhouse and Druid, are developing or have developed similar features like 
https://clickhouse.com/docs/en/engines/table-engines/integrations/kafka and 
https://github.com/apache/druid/pull/14424.
   Based on the current Pinot architecture, it is possible to add the feature 
with following features and constraints:
   
   - Ingests from **multiple stream topics** and formats a same Pinot table.
   - Different stream topics could be with **different number of partitions,** 
and even different data format (json, avro, protobuf, etc) meaning Pinot table 
should be able to use different decoder to decode data from different tables 
accordingly.
   - Same transformation and indexing strategy is applied to the decoded data 
from different topics. This limitation is due to the TableConfig structure we 
are defining, could be resolved if some major TableConfig refactor done. Even 
with this limitation, transformation could be easily done by using existing 
dynamic transformation features like SchemaConformingTransformer introduced in 
https://github.com/apache/pinot/pull/12788.
   - Starts from **LLC**.
   - Table schema evolution, **stream partition number expansion and auto 
catch-up**, instance assignment strategies need to have same support without 
regressions.
   - In short term, we do not consider adding or removing topics from the 
stream topics list.
   
   The implementation strategy should consider **decoupling the partition 
concept between stream and Pinot**. Theoretically, stream and OLAP db are two 
independent infra and storages. They should have their own partition strategies 
instead of having hard dependencies on the other. Pinot segment partition is 
only directly used for segment management. The data consumption of each segment 
partition should not be hardly coupled with stream's partition. The abstraction 
layer could be built in between to manage the mapping.
   With this feature, it could also enhances ingestion performance and solves 
the issue like https://github.com/apache/pinot/issues/13319 to have multiple 
segment partitions consuming from same topic partition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Multiple upstream source ingestion support on Pinot [pinot]

Reply via email to