Re: [I] [Feature][Zeta] Support lightweight edge collector clients for remote data collection [seatunnel]

via GitHub Tue, 31 Mar 2026 02:49:12 -0700


zhangshenghang commented on issue #10666:
URL: https://github.com/apache/seatunnel/issues/10666#issuecomment-4161328289


   > > I have a few questions:
   > > 
   > > 1. Does Sink need to be supported?
   > > 2. Does edge collection need to be planned to support a cluster mode? 
For example, if the data source for edge collection is very large and a single 
node cannot support synchronization, collection needs to be performed in a 
cluster mode on edge machines. Different approaches may lead to different 
design solutions and varying levels of complexity.
   > 
   > Good question, these two questions are very important.
   > 
   > My current thinking is:
   > 
   > 1. Sink should not be in the first scope.
   > 
   > For this issue, I suggest we focus on the **source / collection side 
only** first.
   > 
   > The reason is that a lightweight edge collector is mainly meant to solve 
**"data can only be accessed on remote hosts, but processing should stay 
centralized in Zeta"**.
   > 
   > If we include Sink in the first design, the problem becomes much larger:
   > 
   > * remote write-back into isolated networks
   > * reverse traffic / reverse tunnel design
   > * delivery semantics for sink acknowledgements
   > * much more operational complexity
   > 
   > So my suggestion is:
   > 
   > * **V1:** edge collector for source-side ingestion only
   > * **Future extension:** discuss an edge sink model separately if there is 
a real demand
   > 
   > 2. Edge cluster mode should not be the MVP, but the design should leave 
room for it.
   > 
   > I agree that this question matters a lot.
   > 
   > If edge collection must support a real cluster mode in the first version, 
the complexity will increase significantly, because then we need to think about:
   > 
   > * edge-side coordination
   > * partition assignment and rebalance
   > * failover between edge nodes
   > * edge-side state / checkpoint ownership
   > * service discovery between edge nodes
   > 
   > Because of that, my preference would be:
   > 
   > * **V1:** single-agent model, or multiple independent agents without an 
edge-cluster control plane
   > * **V2+:** if needed, add formal edge-cluster support
   > 
   > In other words, for large sources, the first practical step could be to 
allow **multiple independent edge agents** to send different partitions / 
directories / topics / shards into Zeta, without introducing a separate 
edge-cluster scheduler in the first version.
   > 
   > That keeps the first design much simpler, while still leaving room for 
future evolution.
   > 
   > So overall, my current preference is:
   > 
   > * **Scope of this issue:** source-side edge collection
   > * **MVP:** lightweight agent + central Zeta ingress
   > * **Not in MVP:** edge sink + full edge-cluster control plane
   > 
   > If this scope sounds reasonable, I can also add a follow-up comment to 
outline a possible MVP boundary more concretely.
   
   This is an excellent feature, and I look forward to its implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Feature][Zeta] Support lightweight edge collector clients for remote data collection [seatunnel]

Reply via email to