mangrrua opened a new issue #5598: URL: https://github.com/apache/incubator-pinot/issues/5598
Spark-Pinot connector to allow read data from Pinot in Spark directly(write support will be added) https://github.com/mangrrua/spark-pinot-connector ### Motivation Pinot has Spark batch ingestion job for offline processing. Spark batch ingestion waits input files(partitioned or not etc), and converts to the segment files, then push to Pinot. Simplified steps for a batch ingestion with spark; 1. Analyze data with your favorite tool like spark 2. Write output as csv, json, parquet etc to file system 3. Trigger Pinot Spark batch ingestion job that will create segment files from processed files in step 2 (job will write segment files to the pinot deep storage) 4. Wait Spark batch ingestion job result As you can see, firstly we have to write data to the some temporary location(step 2). This is unnecessary operation if you will not use it for other purposes. If you want to re-index your offline table(because requirements is changed continuously), you can use processed data in step2. But also you have to keep it(but the same data is stored in pinot). Then you can re-index data again. But in this situation, your output files and pinot table data should be same, you have to guarantee that. With spark-pinot connector, you can read data from Pinot directly, and write data to the Pinot directly(write support will be added). With this connector, you will have a lot of flexibility like that; - Pinot -> Spark -> Pinot - Pinot -> Spark - Anywhere - Anywhere -> Spark -> Pinot You no need to hold processed csv/json/parquet files(step 2). Connector will be write data to the Pinot Deep Storage directly. You no need to trigger an additional Spark batch ingestion job to create segment files. You can re-index data easily. Just read from Pinot, and analyze it, and write it to the other table. Please, share your comments, thoughts. I also wrote future works about connector in the _README_ file in spark-pinot-connector repository. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
