[I] [Feature Request]: Data Stream metadata and plugin type (hop)

via GitHub Sat, 04 Apr 2026 09:59:47 -0700


mattcasters opened a new issue, #6963:
URL: https://github.com/apache/hop/issues/6963


   ### What would you like to happen?
   
   ### Proposed Overall Structure
   
   - **New Metadata Plugin Type**: `DataStreamPluginType`  
     A new top-level plugin type (similar to other metadata plugin types in 
Hop). It should live in the core/engine or as its own module under 
`plugins/metadata` (or a dedicated `plugins/datastreams` category).
   
   - **Metadata Element**: `Data Stream` (user-facing name)  
     - Annotated with `@HopMetadata(key = "data-stream", name = "Data Stream", 
...)`
     - Stored as JSON in the project’s `metadata/data-stream/` folder.
     - Fully manageable and editable in the **Metadata perspective**.
   
   - **Two New Transforms** (to be placed under `plugins/transforms`):
     - **Data Stream Output** – writes rows to a selected Data Stream (producer 
side)
     - **Data Stream Input** – reads rows from a selected Data Stream (consumer 
side)
   
   - **Pluggable Implementations** (via the new `DataStreamPluginType`):
     1. **Arrow Socket** – for same-machine or simple multi-process streaming 
using Arrow IPC over TCP or Unix domain sockets.
     2. **Arrow Flight** – ideal for distributed scenarios (Spark / Flink) and 
network use cases; supports parallel writes from multiple executors via `DoPut` 
to a central Flight server.
     3. **Arrow File** – simplest decoupled option; writes Arrow IPC streams or 
files (including partitioned writes to local disk, HDFS, or S3). Downstream 
processes can read and optionally merge the data.
   
   Additional implementations (ZeroMQ + Arrow, Kafka with Arrow serialization, 
shared memory, etc.) can be added later without changing the core metadata or 
the two transforms.
   
   ### How It Would Work in Practice
   
   1. **Define a Data Stream** in the Metadata perspective:
      - Name: e.g. `PythonArrowStream`, `SparkCollector`, 
`ExternalPythonConsumer`
      - Implementation: choose Arrow Socket, Arrow Flight, or Arrow File
      - Implementation-specific settings (port range, Flight endpoint, base 
path, etc.)
      - Common settings: batch size, direction, optional schema reference, 
buffer configuration
   
   2. **Use in Pipelines**:
      - Drop a **Data Stream Output** transform and select the desired Data 
Stream by name (dropdown, just like a database connection).
      - Drop a **Data Stream Input** transform and select the same (or another) 
Data Stream by name.
      - Connections are initialized lazily when the transform first needs them.
   
   3. **Distributed Support (Hop on Spark / Hop on Flink)**:
      - Metadata is automatically serialized and distributed to all executors 
(standard Hop/Beam behavior).
      - **Arrow Flight** handles parallel writes naturally.
      - **Arrow Socket** can use port ranges or a thin merge layer.
      - **Arrow File** lets each executor write partitioned files that can be 
merged downstream.
   
   This design keeps the user experience simple and metadata-driven while 
providing powerful, high-performance back-ends under the hood. It builds 
directly on existing Hop patterns such as **Pipeline Probe** metadata and the 
metadata injection / selection mechanisms.
   
   ### Issue Priority
   
   Priority: 3
   
   ### Issue Component
   
   Component: Transforms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Feature Request]: Data Stream metadata and plugin type (hop)

Reply via email to