[GitHub] [incubator-pinot] mangrrua opened a new issue #5598: Spark-Pinot connector to read and write data from/to Pinot directly

GitBox Sun, 21 Jun 2020 05:08:49 -0700


mangrrua opened a new issue #5598:
URL: https://github.com/apache/incubator-pinot/issues/5598



   Spark-Pinot connector to allow read data from Pinot in Spark directly(write 
support will be added)
   
   https://github.com/mangrrua/spark-pinot-connector
   
   ### Motivation
   
   Pinot has Spark batch ingestion job for offline processing. Spark batch 
ingestion waits input files(partitioned or not etc), and converts to the 
segment files, then push to Pinot. Simplified steps for a batch ingestion with 
spark;
   
   1. Analyze data with your favorite tool like spark
   2. Write output as csv, json, parquet etc to file system
   3. Trigger Pinot Spark batch ingestion job that will create segment files 
from processed files in step 2 (job will write segment files to the pinot deep 
storage)
   4. Wait Spark batch ingestion job result
   
   As you can see, firstly we have to write data to the some temporary 
location(step 2). This is unnecessary operation if you will not use it for 
other purposes.
   
   If you want to re-index your offline table(because requirements is changed 
continuously), you can use processed data in step2. But also you have to keep 
it(but the same data is stored in pinot). Then you can re-index data again. But 
in this situation, your output files and pinot table data should be same, you 
have to guarantee that. 
   
   With spark-pinot connector, you can read data from Pinot directly, and write 
data to the Pinot directly(write support will be added). With this connector, 
you will have a lot of flexibility like that;
   
   - Pinot -> Spark -> Pinot
   - Pinot -> Spark - Anywhere
   - Anywhere -> Spark -> Pinot
   
   You no need to hold processed csv/json/parquet files(step 2). Connector will 
be write data to the Pinot Deep Storage directly. You no need to trigger an 
additional Spark batch ingestion job to create segment files. 
   
   You can re-index data easily. Just read from Pinot, and analyze it, and 
write it to the other table. 
   
   Please, share your comments, thoughts. 
   
   I also wrote future works about connector in the _README_ file in 
spark-pinot-connector repository. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-pinot] mangrrua opened a new issue #5598: Spark-Pinot connector to read and write data from/to Pinot directly

Reply via email to