[GitHub] [druid] JulianJaffePinterest commented on issue #9780: Support directly reading and writing Druid data from Spark

GitBox Tue, 13 Jul 2021 01:37:51 -0700


JulianJaffePinterest commented on issue #9780:
URL: https://github.com/apache/druid/issues/9780#issuecomment-878893158



   If a Spark application produces segments with the exact same version as a 
real-time task, the precise behavior depends on the shard spec but will likely 
result in neither set of segments being loaded by a Druid cluster. A simple way 
to avoid this is to use a version prefix for the segments written via Spark (to 
ensure they always overshadow real-time segments). If you want Spark-produced 
and real-time segments to live alongside each other, you could follow the 
major/minor version approach that Druid internal compaction jobs do. And if you 
want real-time and Spark-produced segments to "compete" and have the latest 
task win, you can use time stamp versions and slice off a digit of precision 
from the time stamp in Spark to ensure that the two different methods can't 
produce identical versions. Down the road, if there's community demand/a 
specific use case where it makes sense to integrate this writer with Druid task 
locks, it shouldn't be too difficult to do so, but I haven't come ac
 ross one so far. Maybe as Druid's internal tasks continue to grow in power and 
complexity some will arise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] JulianJaffePinterest commented on issue #9780: Support directly reading and writing Druid data from Spark

Reply via email to