fx19880617 opened a new issue #6610:
URL: https://github.com/apache/incubator-pinot/issues/6610


   ### Motivation
   Modern data processing/analytics usually involves both streaming and batch 
pipelines. For the streaming pipeline, Pinot leverages Kafka as data sources. 
For the batch pipeline, Pinot only supports file-based segment generation in 
Hadoop and Spark. The split of real-time/batch pipelines makes data management 
harder and more error-prone. 
   
   Based on recent industry trends and works around computation framework 
consolidation(E.g. Flink/Spark), users now can re-use the same computation 
logic but just changing the data sources and data sinks. Specifically to Pinot:
   For the Streaming side, we can still use Kafka sink and let Pinot consume 
from it, so we ensure the ingestion latency is as low as possible.
   For the Batch side, it would be good to have a Pinot SegmentWriter to 
wrapper all the segment generation and push logic.
   
   Simply taking the APIs that we’re calling in the Spark connector is not a 
good option. The APIs used in the Spark connector expose a lot of Pinot 
internals (e.g. you need to create and init a SegmentCreationDriverImpl with a 
lot of values). Indeed an abstraction is needed over the Segment creation part. 
Pinot Segment Write API will be that abstraction. This will easily let us write 
connectors for more such sources, without each source having to know how to use 
the Pinot segment generation code.
   
   ### Detailed design doc to follow:
   
https://docs.google.com/document/d/1f_JlegCkH_Zysm80maLnv7iqgWtD9uPiBLkeLmMUoNg/edit?usp=sharing
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to