JulianJaffePinterest commented on issue #9780: URL: https://github.com/apache/druid/issues/9780#issuecomment-878893158
If a Spark application produces segments with the exact same version as a real-time task, the precise behavior depends on the shard spec but will likely result in neither set of segments being loaded by a Druid cluster. A simple way to avoid this is to use a version prefix for the segments written via Spark (to ensure they always overshadow real-time segments). If you want Spark-produced and real-time segments to live alongside each other, you could follow the major/minor version approach that Druid internal compaction jobs do. And if you want real-time and Spark-produced segments to "compete" and have the latest task win, you can use time stamp versions and slice off a digit of precision from the time stamp in Spark to ensure that the two different methods can't produce identical versions. Down the road, if there's community demand/a specific use case where it makes sense to integrate this writer with Druid task locks, it shouldn't be too difficult to do so, but I haven't come ac ross one so far. Maybe as Druid's internal tasks continue to grow in power and complexity some will arise. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
