churromorales commented on PR #13156:
URL: https://github.com/apache/druid/pull/13156#issuecomment-1278289541
I noticed a slight issue when testing on heavily loaded k8s clusters. Those
that take a while to spin up / down pods.
Currently the way things work:
1. The overlord launches a task, waits for the pod to come up, it not
only waits for a pod ip to be available, but also opens a socket connection to
the peon processes web server. Once you have a socket connection the task is
considered started.
2. Then it will monitor the job to complete, but this is a blocking
call, after the job completes, sends pushes task reports, logs and then does a
delete for the peon k8s job.
What was problematic here, is I noticed that the process was complete, the
task itself was finished, but the k8s operations were slow on my cluster. The
task itself took less than a minute to complete, but the entire k8s lifecycle
was taking much longer around 10 minutes. (A very heavily loaded cluster).
Thus the task status only updates in the finally block so the interval lock
is not released until the status updates. I have another patch I will test on
our clusters but a brief description of the patch is this:
1. The overlord launches a task, waits for the pod to come up. No more
opening a socket for the webserver. Instead I added 2 TaskActions: One to
update the `TaskStatus`, one to update the `TaskLocation`.
2. Now in the `AbstractTask` the setup method will update its own location
before the `run` method is called. If the overlord goes down for whatever
reason, that is fine. We don't lose anything really, the call fails and the
task itself dies.
3. In the cleanup method of the `AbstractTask`we also update the
`TaskLocation` to `unknown` and update the status to `success` or `failure`.
So when the process exits, we can give up the lock and things wont be blocked.
In case the pod goes away during the cleanup, that is okay, the overlord will
still monitor the job and report the correct results albeit a bit slower.
I thought about this approach originally, but didn't want to make too many
changes to `core`, but after running this patch on our clusters, I do think
this is the best approach. I will call it out in a separate commit so you guys
can review this work, it doesn't seem too abrasive to me, but we should focus
on correctness here and not hold the locks longer than necessary as we were
doing in the forking task runner.
I will test this out over the next day or so and then add the commit to this
PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]