churromorales opened a new issue, #12892:
URL: https://github.com/apache/druid/issues/12892
### Motivation
Right now we need to know the number of task slots to allocate in advance
to configure middle managers appropriately. We could leverage the k8s
scheduler to do this to make configurations easier for customers. We could have
an upper limit on concurrent tasks, but this would allow the system to become
more elastic as when those slots are not being used, those Kubernetes resources
originally allocated for the middle manager would be freed up.
### Proposed changes
We need to reduce the shared dependency of the middle managers file system
for the peon tasks. It is difficult to share files dynamically between a
kubernetes container and the application launching it. ConfigMap files do exist
for this purpose but there is currently no lifecycle management for a ConfigMap
where after a job terminates the ConfigMap is deleted. Additionally many k8s
environments have a quota on the number of ConfigMaps one can use. Additionally
there are size limitations for how big ConfigMaps can be.
*We need to resolve the following:*
1. The TaskLogPusher currently pushes the reports.json file from the middle
manager, this needs to move into the task itself. The task running in the k8s
job cannot call back to the middle manager anymore.
2. Saving tasks: (only happen in k8s mode)
1. Whenever we create intermediate persist segment files, we should also
push that to deep storage in a directory specific to the task itself.
2. When the middle manager writes a restore.json file, that should also
be pushed to deep storage (when in k8s mode).
3. Restoring tasks: (only happens in k8s mode)
1. When a task starts up, the first thing it does is pull down the
intermediate persist files and the restore.json file to a local volume in its
own task dir. That way when it starts it behaves the same way it would as if it
were running in non-k8s mode.
4. If the task itself can push segments, I don’t think it is unreasonable
for it to push other relevant data.
### Rationale
We would still leverage the middle manager for the first pass. This
hopefully paves the way for one day removing the MiddleManager and having the
overlord itself launch these tasks as k8s jobs. For the first pass due to how
tightly coupled the filesystem is between the middle manager and peon tasks, we
want to propose something that is a stepping stone in getting us to a middle
manager-less world.
This patch has been proposed by someone in the community:
https://github.com/apache/druid/pull/10910
I reviewed this patch and I don’t believe it handles any sort of
checkpointing. I also believe it ignores the task report as well. I believe
keeping tasks as k8s jobs instead of pods, will allow the k8s scheduler to
handle the lifecycle better as well. Right now with the pod based approach, if
the middle manager unexpectedly dies, those pods are not cleaned up. While we
can utilize some of the work, I think some key features are missing from the
way Druid currently works.
### Operational impact
Should be none, there will be one or two configuration options to launch
tasks as k8s jobs.
### Future work
Remove the middle manager dependency completely. Right now the filesystem
is tightly coupled between the middle manager and tasks, once we can launch
tasks from a middle manager successfully we can work on removing the middle
manager altogether.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]