[GitHub] [druid] paul-rogers commented on pull request #12159: Add Spark Writer support.

GitBox Sun, 05 Jun 2022 18:30:04 -0700


paul-rogers commented on PR #12159:
URL: https://github.com/apache/druid/pull/12159#issuecomment-1146943391


   @JulianJaffePinterest, this is a great contribution! This code writes to 
Druid's metadata DB directly. I've found the need to support "external tasks": 
ingest tasks that run outside of Druid, but interact with Druid's Overlord the 
same way that an "internal" (Druid `Task`) would. I'll do a PR in a week or 
two. In creating that enhancement, I was reminded of the work you did here. 
Might such an approach work in this application as an alternative way to 
publish the segments which Spark creates?
   
   Druid's own ingestion tasks do most of what you described for this set of 
PRs:
   
   * Read some input.
   * Build some segments and upload them to deep storage.
   * Publish the segments (by telling the Overlord about them, which triggers 
the Coordination actions).
   * Release locks obtained during the above process.
   
   The main difference between Spark and Druid's native ingestion is that, with 
Spark, the ingestion "task" runs outside of Druid.  If the Spark task ran 
inside of Druid (that is, launched by Overlord), then the Overlord task could 
use the existing task APIs to do the steps above. But, since Spark (and my 
project) live outside of Druid, that doesn't work. So, the "external task" API 
is a variation on the existing, internal API, but with a way for Overlord to 
monitor the external task, since Overlord can't do so directly.
   
   Basically, you need a way for Overlord to check that your task is still 
alive, and so it should maintain your locks. This can be done via ZK (for 
long-running services) or via a heartbeat (for everything else). With that:
   
   * Register an external task. Either tell the Overlord of a ZK host to 
monitor, or promise to keep up a heartbeat.
   * Submit `TaskAction`s (for locking, publishing, etc.)
   * Either "commit" the task (saying your task succeeded) or "fail" the task 
(saying that your task failed). Both release any locks and tell OL to stop 
worrying about your task.
   
   With this, Spark ingestion is easier: it looks to Overlord just like Druid's 
own internal ingestion, with the one difference that Overlord did not launch, 
and cannot monitor, the actual task. (Hence the heartbeat.) This will avoid the 
need for your code to handle all of Druid's metadata DB storage choices, and 
the need to track changes to Druid's metadata DB logic.
   
   If the "external task" feature would be of interest for this feature, I can 
work with you to integrate that API once the external task PR goes through the 
review process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] [druid] paul-rogers commented on pull request #12159: Add Spark Writer support.

Reply via email to