[GitHub] [druid] JulianJaffePinterest commented on issue #9780: Support directly reading and writing Druid data from Spark

GitBox Thu, 08 Jul 2021 03:17:24 -0700


JulianJaffePinterest commented on issue #9780:
URL: https://github.com/apache/druid/issues/9780#issuecomment-876317064



   @520lailai The calling Spark applications set the version of the segments 
they write themselves. If a user runs multiple jobs that generate segments for 
the same data source and time chunk, the application the user assigns the 
highest version to will produce the ultimately available segments, once all 
applications have finished. If you have a specific use case in mind where you 
will be running concurrent Spark jobs that will target the same data sources 
and intervals, I'd be happy to give you more tailored suggestions.
   
   As for Druid task locks, a Spark application calling this writer is not a 
Druid task. The application is not triggered by a Druid cluster, and the Druid 
cluster is unaware of the application. If we instead view Druid task locks as 
segment locks, I could imagine using the internal API to acquire a lock on 
write, but it would only be useful in limited circumstances. I don't see value 
in delaying a Spark job from writing (the job will write segments with the 
specified version regardless; timeshifting the write doesn't change anything). 
I can see where delaying a real-time ingestion task may be useful. If there's 
community demand for integrating with Druid locks it could be done. 
   
   Finally, to your point about security concerns, the writer must provide its 
own metadata server credentials. The metadata client will only ever attempt to 
read and insert data into an existing table, and so ideally the associated user 
should have only those permissions. If you're planning to provide credentials 
independently of users (for example, via environment variables or a credential 
store running on the Spark cluster nodes themselves), you should not allow 
anyone to submit Spark applications you would not allow to send post requests 
to the overlord.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] JulianJaffePinterest commented on issue #9780: Support directly reading and writing Druid data from Spark

Reply via email to