mattcasters opened a new issue, #6963:
URL: https://github.com/apache/hop/issues/6963
### What would you like to happen?
### Proposed Overall Structure
- **New Metadata Plugin Type**: `DataStreamPluginType`
A new top-level plugin type (similar to other metadata plugin types in
Hop). It should live in the core/engine or as its own module under
`plugins/metadata` (or a dedicated `plugins/datastreams` category).
- **Metadata Element**: `Data Stream` (user-facing name)
- Annotated with `@HopMetadata(key = "data-stream", name = "Data Stream",
...)`
- Stored as JSON in the project’s `metadata/data-stream/` folder.
- Fully manageable and editable in the **Metadata perspective**.
- **Two New Transforms** (to be placed under `plugins/transforms`):
- **Data Stream Output** – writes rows to a selected Data Stream (producer
side)
- **Data Stream Input** – reads rows from a selected Data Stream (consumer
side)
- **Pluggable Implementations** (via the new `DataStreamPluginType`):
1. **Arrow Socket** – for same-machine or simple multi-process streaming
using Arrow IPC over TCP or Unix domain sockets.
2. **Arrow Flight** – ideal for distributed scenarios (Spark / Flink) and
network use cases; supports parallel writes from multiple executors via `DoPut`
to a central Flight server.
3. **Arrow File** – simplest decoupled option; writes Arrow IPC streams or
files (including partitioned writes to local disk, HDFS, or S3). Downstream
processes can read and optionally merge the data.
Additional implementations (ZeroMQ + Arrow, Kafka with Arrow serialization,
shared memory, etc.) can be added later without changing the core metadata or
the two transforms.
### How It Would Work in Practice
1. **Define a Data Stream** in the Metadata perspective:
- Name: e.g. `PythonArrowStream`, `SparkCollector`,
`ExternalPythonConsumer`
- Implementation: choose Arrow Socket, Arrow Flight, or Arrow File
- Implementation-specific settings (port range, Flight endpoint, base
path, etc.)
- Common settings: batch size, direction, optional schema reference,
buffer configuration
2. **Use in Pipelines**:
- Drop a **Data Stream Output** transform and select the desired Data
Stream by name (dropdown, just like a database connection).
- Drop a **Data Stream Input** transform and select the same (or another)
Data Stream by name.
- Connections are initialized lazily when the transform first needs them.
3. **Distributed Support (Hop on Spark / Hop on Flink)**:
- Metadata is automatically serialized and distributed to all executors
(standard Hop/Beam behavior).
- **Arrow Flight** handles parallel writes naturally.
- **Arrow Socket** can use port ranges or a thin merge layer.
- **Arrow File** lets each executor write partitioned files that can be
merged downstream.
This design keeps the user experience simple and metadata-driven while
providing powerful, high-performance back-ends under the hood. It builds
directly on existing Hop patterns such as **Pipeline Probe** metadata and the
metadata injection / selection mechanisms.
### Issue Priority
Priority: 3
### Issue Component
Component: Transforms
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]