[I] [PEP] stream plugin that supports object-storage micro-batches with optional inline batch support [pinot]

via GitHub Mon, 08 Dec 2025 02:03:44 -0800


udaysagar2177 opened a new issue, #17331:
URL: https://github.com/apache/pinot/issues/17331


   ## Summary:
   Introduce a new stream plugin that can handle external files or inline 
batches for real-time ingestion.
   
   ## Motivation:
   - Reduce Kafka cross-AZ replication, production, and consumption network 
costs.
   - Kafka delivers sub-second message latency but requires substantial 
infrastructure; many use cases can tolerate higher latency at lower cost.
   - High-volume traffic can exceed Kafka broker throughput or retention 
limits, necessitating complex operational management.
   
   ## Core Mechanics
   - Partition-level consumers receive micro-batch descriptor records instead 
of individual events.
   - Each record triggers a controlled background fetch logic that downloads 
the referenced object via **PinotFS**.
   - A dedicated thread extracts events using the configured **RecordReader** 
and pushes them into a bounded in-memory queue.
   - The **LLC consumer** remains unchanged and reads from this in-memory queue 
as if it were a normal streaming source.
   
   ## Offset Model
   - Offsets use a serialized JSON structure similar to the **Kinesis** stream 
plugin.
   - JSON tracks:
     - The Kafka record offset carrying the micro-batch descriptor record.
     - The intra-file (or intra-batch) event offset.
   - Supports replay correctness, restart recovery, and stable start/end offset 
behavior.
   
   ## Other advantages:
   - Enables real-time ingestion of Avro or Parquet files without complicating 
the architecture with Spark or a separate batch ingestion job.
   - Enables improved compression efficiency using inline batches compared to 
the message-per-event model.
   
   ## Micro-batch descriptor protocol
   - The micro-batch protocol defines **deterministic sub-selection rules**.
   - Consumers may extract:
     - The full batch, or
     - Only the assigned-partition subset,
   - Replay semantics remain fully stable.
   
   ## Expected Outcome
   A stream plugin that supports object-storage micro-batches with optional 
inline batch support, reduces the impact of Kafka, and simplifies ingestion 
pipelines for file-based formats.
   
   If this proposal aligns with the project’s direction. I would be happy to 
move it forward and submit a pull request for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [PEP] stream plugin that supports object-storage micro-batches with optional inline batch support [pinot]

Reply via email to