udaysagar2177 opened a new issue, #17331:
URL: https://github.com/apache/pinot/issues/17331
## Summary:
Introduce a new stream plugin that can handle external files or inline
batches for real-time ingestion.
## Motivation:
- Reduce Kafka cross-AZ replication, production, and consumption network
costs.
- Kafka delivers sub-second message latency but requires substantial
infrastructure; many use cases can tolerate higher latency at lower cost.
- High-volume traffic can exceed Kafka broker throughput or retention
limits, necessitating complex operational management.
## Core Mechanics
- Partition-level consumers receive micro-batch descriptor records instead
of individual events.
- Each record triggers a controlled background fetch logic that downloads
the referenced object via **PinotFS**.
- A dedicated thread extracts events using the configured **RecordReader**
and pushes them into a bounded in-memory queue.
- The **LLC consumer** remains unchanged and reads from this in-memory queue
as if it were a normal streaming source.
## Offset Model
- Offsets use a serialized JSON structure similar to the **Kinesis** stream
plugin.
- JSON tracks:
- The Kafka record offset carrying the micro-batch descriptor record.
- The intra-file (or intra-batch) event offset.
- Supports replay correctness, restart recovery, and stable start/end offset
behavior.
## Other advantages:
- Enables real-time ingestion of Avro or Parquet files without complicating
the architecture with Spark or a separate batch ingestion job.
- Enables improved compression efficiency using inline batches compared to
the message-per-event model.
## Micro-batch descriptor protocol
- The micro-batch protocol defines **deterministic sub-selection rules**.
- Consumers may extract:
- The full batch, or
- Only the assigned-partition subset,
- Replay semantics remain fully stable.
## Expected Outcome
A stream plugin that supports object-storage micro-batches with optional
inline batch support, reduces the impact of Kafka, and simplifies ingestion
pipelines for file-based formats.
If this proposal aligns with the project’s direction. I would be happy to
move it forward and submit a pull request for review.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]