paul-rogers commented on PR #12159: URL: https://github.com/apache/druid/pull/12159#issuecomment-1146943391
@JulianJaffePinterest, this is a great contribution! This code writes to Druid's metadata DB directly. I've found the need to support "external tasks": ingest tasks that run outside of Druid, but interact with Druid's Overlord the same way that an "internal" (Druid `Task`) would. I'll do a PR in a week or two. In creating that enhancement, I was reminded of the work you did here. Might such an approach work in this application as an alternative way to publish the segments which Spark creates? Druid's own ingestion tasks do most of what you described for this set of PRs: * Read some input. * Build some segments and upload them to deep storage. * Publish the segments (by telling the Overlord about them, which triggers the Coordination actions). * Release locks obtained during the above process. The main difference between Spark and Druid's native ingestion is that, with Spark, the ingestion "task" runs outside of Druid. If the Spark task ran inside of Druid (that is, launched by Overlord), then the Overlord task could use the existing task APIs to do the steps above. But, since Spark (and my project) live outside of Druid, that doesn't work. So, the "external task" API is a variation on the existing, internal API, but with a way for Overlord to monitor the external task, since Overlord can't do so directly. Basically, you need a way for Overlord to check that your task is still alive, and so it should maintain your locks. This can be done via ZK (for long-running services) or via a heartbeat (for everything else). With that: * Register an external task. Either tell the Overlord of a ZK host to monitor, or promise to keep up a heartbeat. * Submit `TaskAction`s (for locking, publishing, etc.) * Either "commit" the task (saying your task succeeded) or "fail" the task (saying that your task failed). Both release any locks and tell OL to stop worrying about your task. With this, Spark ingestion is easier: it looks to Overlord just like Druid's own internal ingestion, with the one difference that Overlord did not launch, and cannot monitor, the actual task. (Hence the heartbeat.) This will avoid the need for your code to handle all of Druid's metadata DB storage choices, and the need to track changes to Druid's metadata DB logic. If the "external task" feature would be of interest for this feature, I can work with you to integrate that API once the external task PR goes through the review process. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org