[GitHub] [druid] jose-robles2 opened a new issue, #13724: Pravega Druid extension

via GitHub Mon, 30 Jan 2023 23:12:39 -0800


jose-robles2 opened a new issue, #13724:
URL: https://github.com/apache/druid/issues/13724


   ### Motivation
   Pravega is a storage system for data streams. It stores large amounts of 
data in a row-oriented manner which allows for all data points relating to one 
object to be stored in the same data block. This is beneficial for queries 
needing to read and manipulate an entire object, but it is slow to analyze 
large amounts of data. This is an issue because when we want to process events 
via big data
   analytics queries, efficiency is poor. Creating a Druid Pravega extension 
would allow for Pravega streams to be fetched and stored in Druid's column 
oriented structure allowing for efficient querying on the data.
   
   ### Proposed changes
   Within druid/extension-core, the `pravega-indexing-service` directory will 
be created to store the plugin/connector. We aim to create a custom class that 
will inherit from the SeekableStream class (similar to the Kafka connector).  
This will allow for a stream to be read from a specified point in time. 
   
   Pravega API will be used to fetch the events contained within the streams 
which will then need to be transformed to a druid compatible format such as 
JSON (with timestamp and dimensions added).  Then an ingestion task will need 
to be started utilizing the Druid Overlord, alongside the ingestion spec which 
tells Druid where to find the data, the format of the data, and how to index 
it. Lastly, the Druid ingestion API will be utilized to store the information 
into Druid.
   
   The existing Kafka Druid connector will be used as a reference for creating 
this Pravega Druid connector.
   
   ### Rationale
   This solution is the best one because Druid works best with event oriented 
data, perfect for Pravega integration since Pravega's streams contain segments 
which consist of events that were gathered from a variety of data sources. 
Druid's high efficiency and ability to compress columns made it the best choice 
to integrate with Pravega. 
   
   ### Operational impact
   The operational impact of adding a Druid Pravega connector would depend on 
factors such as the size of the data being processed, current Druid 
infrastructure, as well as the design of the Druid Pravega connector itself. 
The connector could increase load on the system if the connector handles a lot 
of data, potentially decreasing performance.
   
   Nothing in the current Druid design will be removed, only additions will be 
made to mimic the Kafka Druid connector. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] jose-robles2 opened a new issue, #13724: Pravega Druid extension

Reply via email to