davidzollo opened a new issue, #10666:
URL: https://github.com/apache/seatunnel/issues/10666

   # GitHub Issue Draft
   
   ## Repo
   
   `apache/seatunnel`
   
   ## Proposed Title
   
   `[Feature][Zeta] Support lightweight edge collector clients for remote data 
collection`
   
   ## Proposed Body
   
   ```md
   ### Search before asking
   
   I searched existing feature requests and did not find a similar proposal for 
a lightweight edge collector / remote agent model for SeaTunnel Zeta.
   
   ### Description
   
   I would like to start a discussion about whether SeaTunnel Zeta should 
support a lightweight edge collector client that runs on remote hosts, collects 
local data, and sends the data stream into a central Zeta cluster for transform 
and sink processing.
   
   The main goal is not to replace the current SeaTunnel job client. Instead, 
this would introduce an edge-side collection model for scenarios where the data 
source is only reachable from remote hosts, or where users want a very small 
local process for collection while keeping scheduling, transformation, 
checkpoint coordination, and sink execution centralized in Zeta.
   
   Typical examples:
   
   * collecting local files or logs from remote machines
   * collecting application events or metrics from private network hosts
   * collecting data through custom local SDKs or internal protocols that 
should not run directly inside the Zeta cluster
   
   Today, SeaTunnel already has:
   
   * an engine client for job submission
   * source connectors that run inside worker tasks
   * a socket connector that can prove basic network ingestion
   
   However, there is no first-class model for a lightweight remote collector 
that focuses only on collection + buffering + transport.
   
   I think this could be useful if SeaTunnel wants to support an "edge 
collection, central processing" architecture.
   
   ### Usage Scenario
   
   One example is a company that has many business hosts in isolated network 
zones. Those hosts can access local files, local applications, or internal 
services, but the central SeaTunnel Zeta cluster cannot directly access those 
sources.
   
   In that case, a lightweight collector could:
   
   * run as a small daemon or sidecar on the remote host
   * collect data locally
   * buffer and retry locally
   * securely send batches to a Zeta-side ingress endpoint
   
   Then the Zeta cluster would still be responsible for:
   
   * pipeline execution
   * transform logic
   * checkpoint and recovery coordination
   * downstream sink delivery
   
   This would be especially helpful for:
   
   * edge log collection
   * remote file ingestion
   * custom event collection
   * environments with strict network isolation
   
   ### Related issues
   
   I found the existing socket-related issue below, but it does not seem to 
cover this broader feature proposal:
   
   * #10528
   
   ### Additional discussion points
   
   If the community thinks this direction makes sense, I think the discussion 
should focus on:
   
   * whether this should be a new `agent-source` / ingress model instead of 
extending the current job client
   * what delivery guarantee should be the MVP: at-least-once vs exactly-once
   * whether the first version should target logs/files/custom event sources 
first, instead of CDC/database scenarios
   * how to keep the design compatible with the current Zeta source/checkpoint 
model
   
   I am opening this issue mainly for design discussion first.
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to