zhangshenghang commented on issue #10666: URL: https://github.com/apache/seatunnel/issues/10666#issuecomment-4161328289
> > I have a few questions: > > > > 1. Does Sink need to be supported? > > 2. Does edge collection need to be planned to support a cluster mode? For example, if the data source for edge collection is very large and a single node cannot support synchronization, collection needs to be performed in a cluster mode on edge machines. Different approaches may lead to different design solutions and varying levels of complexity. > > Good question, these two questions are very important. > > My current thinking is: > > 1. Sink should not be in the first scope. > > For this issue, I suggest we focus on the **source / collection side only** first. > > The reason is that a lightweight edge collector is mainly meant to solve **"data can only be accessed on remote hosts, but processing should stay centralized in Zeta"**. > > If we include Sink in the first design, the problem becomes much larger: > > * remote write-back into isolated networks > * reverse traffic / reverse tunnel design > * delivery semantics for sink acknowledgements > * much more operational complexity > > So my suggestion is: > > * **V1:** edge collector for source-side ingestion only > * **Future extension:** discuss an edge sink model separately if there is a real demand > > 2. Edge cluster mode should not be the MVP, but the design should leave room for it. > > I agree that this question matters a lot. > > If edge collection must support a real cluster mode in the first version, the complexity will increase significantly, because then we need to think about: > > * edge-side coordination > * partition assignment and rebalance > * failover between edge nodes > * edge-side state / checkpoint ownership > * service discovery between edge nodes > > Because of that, my preference would be: > > * **V1:** single-agent model, or multiple independent agents without an edge-cluster control plane > * **V2+:** if needed, add formal edge-cluster support > > In other words, for large sources, the first practical step could be to allow **multiple independent edge agents** to send different partitions / directories / topics / shards into Zeta, without introducing a separate edge-cluster scheduler in the first version. > > That keeps the first design much simpler, while still leaving room for future evolution. > > So overall, my current preference is: > > * **Scope of this issue:** source-side edge collection > * **MVP:** lightweight agent + central Zeta ingress > * **Not in MVP:** edge sink + full edge-cluster control plane > > If this scope sounds reasonable, I can also add a follow-up comment to outline a possible MVP boundary more concretely. This is an excellent feature, and I look forward to its implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
