GitHub user lhotari added a comment to the discussion: Thousands of cluster 
geo-replication for fan-in aggregation ?

> Does pulsar support such a model ?

Yes. However "support" is perhaps not the correct word here since this would be 
a very extreme use case if there are 10K clusters within geo-replication. 

> What are the scalability concerns to be worried about ?

There would be a lot of amplification of traffic to the target cluster "cAgg". 
The traffic throughput matters a lot and there is a concern how to scale 
things. This model wouldn't be scalable from a design perspective when 
thousands of partitions all aggregate to a single partition.

> Any impact on the topic-stats api or admin-api as it lists all replications ?

That's probably not a major concern. However, it could be unmanageable with 
thousands of replications.

> Any impact on the geo-config-store ?

I probably wouldn't use a global configuration store at all in such 
configurations.

> Any other considerations for implementing this model ?

I don't have the context of what the use case is and what the volumes are. 
Based on the provided information, I'd put more focus on why the aggregation is 
needed and how to find a scalable design for aggregation. 

Perhaps the aggregation is a streams processing problem and could be handled 
with multiple levels of aggregation, implemented with Flink and it's Pulsar 
connector? 

If aggregation using geo-replication is really necessary, it would be 
recommended to have a sharded design so that there are multiple aggregation 
clusters where the final results are then aggregated possibly using a streams 
processing solution. 

There are also other types of solutions for aggregation that are compatible 
with Pulsar. For example, StreamNative has announced a "Streaming Lakehouse" 
product "Lakehouse Tiered Storage for Pulsar".  More details in 
[video](https://streamnative.io/videos/streaming-data-into-your-lakehouse-introducing-pulsars-lakehouse-tiered-storage)
 and [blog 
post](https://streamnative.io/blog/streaming-lakehouse-introducing-pulsars-lakehouse-tiered-storage).
 This opens up completely new possibilities for aggregating the results and 
saving on costs.

GitHub link: 
https://github.com/apache/pulsar/discussions/22438#discussioncomment-9017514

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to