mindreader opened a new issue, #13841:
URL: https://github.com/apache/druid/issues/13841
### Affected Version
25.0.0 - and I didn't see any patches since then that would affect this
### Description
- Cluster info
A pretty new cluster with 10 historicals, really just importing data right
now in big batches, which is how the problem manifests
- Configuration info
Disabling zookeeper, using kubernetes leader election and mmless druid so
that tasks are spawned in kubernetes, no middlemanager
# most of the configs I'm using that might be related to this bug
druid.zk.service.enabled=false
druid.discovery.type=k8s
druid.serverview.type=http
druid.coordinator.loadqueuepeon.type=http
druid.discovery.k8s.clusterIdentifier=${env:CLUSTER_ID}
druid.storage.type=s3
druid.storage.bucket=${env:AWS_BUCKET}
druid.storage.baseKey=${env:STORAGE_BASE_KEY}
druid.storage.archiveBaseKey=${env:STORAGE_ARCHIVE_BASE_KEY}
druid.storage.disableAcl=true
druid.s3.endpoint.url=${env:AWS_S3_ENDPOINT_URL}
druid.s3.forceGlobalBucketAccessEnabled=true
druid.extensions.loadList=["druid-basic-security",
"druid-s3-extensions","druid-kafka-indexing-service","druid-datasketches","postgresql-metadata-storage",
"druid-kubernetes-extensions", "druid-kubernetes-overlord-extensions",
"druid-parquet-extensions", "druid-multi-stage-query", "druid-stats",
"druid-time-min-max"]
druid.indexer.runner.javaOptsArray=["-server", "-Xms${TASK_MEMORY}",
"-Xmx${TASK_MEMORY}", "-XX:MaxDirectMemorySize=${TASK_MEMORY}",
"-Duser.timezone=UTC", "-Dfile.encoding=UTF-8", "-XX:+ExitOnOutOfMemoryError",
"-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager"]
druid.indexer.runner.debugJobs=false
druid.indexer.runner.maxTaskDuration=PT96H # was attempting multiple
day batch tasks, but reliability is not quite there
druid.indexer.runner.taskCleanupDelay=P1D
druid.indexer.runner.graceTerminationPeriodSeconds=PT30S
druid.indexer.queue.maxSize=${OVERLORD_MAX_TASK_SLOTS}
# TODO disable middlemanagers in favor of k8s tasks
druid.indexer.runner.type=k8s
druid.indexer.runner.namespace=${KUBE_NAMESPACE}
druid.indexer.storage.type=metadata
druid.indexer.task.encapsulatedTask=true
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=${env:AWS_BUCKET}
druid.indexer.logs.s3Prefix=${env:STORAGE_INDEXER_LOGS_BASE_KEY}
druid.indexer.logs.disableAcl=true
druid.processing.intermediaryData.storage.type=deepstore
druid.coordinator.asOverlord.enabled=true
- Steps to reproduce the problem
Set druid.indexer.queue.maxSize to something low, like 25. Then do a bit
native batch parallel index import that involves hundreds of subtasks.
Everything will work for awhile, with 25 tasks going simultaneously, but soon
new tasks will begin to be stuck in a WAITING status, and there will be no
corresponding job in k8s for it. As time goes on more tasks will go straight
into WAITING and will not spawn. Eventually there will be no more running tasks
in the system and all tasks (all 25 of them) that attempt to run will be in
waiting. You will have to restart your coordinator to get tasks running again.
- The error message or stack traces encountered. Providing more context,
such as nearby log messages or even entire logs, can be helpful.
Here is a log that shows the problem. I added extra logs to narrow down
where the problem is. Explanation of the bug after the log.
```
2023-02-18T16:29:06,342 INFO [qtp1082073129-150]
org.apache.druid.indexing.overlord.MetadataTaskStorage - Inserting task
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
with status:
TaskStatus{id=partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
status=RUNNING, duration=-1, errorMsg=null}
2023-02-18T16:29:06,347 INFO [qtp1082073129-150]
org.apache.druid.indexing.overlord.TaskLockbox - Adding
task[partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z]
to activeTasks
2023-02-18T16:29:06,630 INFO [TaskQueue-Manager]
org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run:
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
2023-02-18T16:29:06,633 INFO [TaskQueue-Manager]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - placed task
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
into before 11 after 12
2023-02-18T16:29:06,633 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - preparing to run k8s task
[
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
2023-02-18T16:29:06,694 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - existing job was not
present [
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
2023-02-18T16:29:06,694 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - Peon Command for K8s job:
/peon.sh
var/druid/task/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
1
2023-02-18T16:29:06,819 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - Job created [
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] and ready to
launch
2023-02-18T16:29:06,886 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Successfully
submitted job: [
partialdimensiondistributiondownloadstechjjgcgced20230218t16290,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] ... waiting
for job to launch
2023-02-18T16:29:21,264 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Took task [
partialdimensiondistributiondownloadstechjjgcgced20230218t16290,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] 14444 ms for
pod to startup
2023-02-18T16:29:21,266 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - Job [
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] launched in
k8s, monitoring
# Task / job is running at this point, then it finishes normally
2023-02-18T16:35:00,252 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - finished waiting for task
[
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] phase
SUCCEEDED
2023-02-18T16:35:00,258 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.common.JobResponse - Duration for Job:
partialdimensiondistributiondownloadstechjjgcgced20230218t16290 was 354000
seconds
2023-02-18T16:35:00,258 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.common.JobResponse - Duration for Job:
partialdimensiondistributiondownloadstechjjgcgced20230218t16290 was 354000
seconds
2023-02-18T16:35:00,258 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - cleaning up task
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
2023-02-18T16:35:00,391 INFO [k8s-task-runner-11]
org.apache.druid.storage.s3.S3TaskLogs - Pushing task log
/tmp/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z11032330796277621083log
to:
druid/indexer_logs/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z/log
2023-02-18T16:35:00,393 INFO [k8s-task-runner-11]
org.apache.druid.storage.s3.S3Utils - Pushing
[/tmp/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z11032330796277621083log]
to bucket[simplecast-analytics-records] and
key[druid/indexer_logs/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z/log].
2023-02-18T16:35:01,267 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Cleaned up k8s
task: [
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
2023-02-18T16:35:01,267 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - about to clean up task
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
task tasks len 20
2023-02-18T16:35:01,267 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - just cleaned up up task
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
task len 19
2023-02-18T16:35:01,267 INFO [k8s-task-runner-11]
org.apache.druid.indexing.overlord.TaskQueue - Received SUCCESS status for
task:
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
2023-02-18T16:35:01,268 INFO [k8s-task-runner-11]
org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
to status:
TaskStatus{id=partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
status=SUCCESS, duration=354000, errorMsg=null}
# Cleanup has finishes and we are good to go, but then oops, qtp reruns the
same job that was just running just 50ms after we finished cleaning up. End
result TaskQueue has 19, KTR has 20. TaskQueue will try to spawn another job
now, KTR will do so if it has spare threads, but if the limit was 20, it will
fail to do so from here on out.
2023-02-18T16:35:01,308 INFO [qtp1082073129-146]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - placed task
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
into before 19 after 20
2023-02-18T16:35:01,308 INFO [k8s-task-runner-20]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - preparing to run k8s task
[
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
2023-02-18T16:35:01,377 INFO [k8s-task-runner-11]
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Failed to
cleanup task: [
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
2023-02-18T16:35:01,377 INFO [k8s-task-runner-11]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z]
from activeTasks
2023-02-18T16:35:01,378 INFO [k8s-task-runner-11]
org.apache.druid.indexing.overlord.TaskQueue - Task SUCCESS:
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
(354000 run duration)
2023-02-18T16:35:01,403 INFO [k8s-task-runner-20]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - existing job was not
present [
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
2023-02-18T16:35:01,404 INFO [k8s-task-runner-20]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - Peon Command for K8s job:
/peon.sh
var/druid/task/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
1
2023-02-18T16:35:01,461 INFO [k8s-task-runner-20]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - Job created [
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] and ready to
launch
2023-02-18T16:35:01,525 INFO [k8s-task-runner-20]
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Successfully
submitted job: [
partialdimensiondistributiondownloadstechjjgcgced20230218t16290,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] ... waiting
for job to launch
2023-02-18T16:35:03,290 INFO [k8s-task-runner-20]
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Took task [
partialdimensiondistributiondownloadstechjjgcgced20230218t16290,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] 1829 ms for
pod to startup
2023-02-18T16:35:03,306 INFO [k8s-task-runner-20]
org.apache.druid.k8s.overlord.KubernetesTaskRunner - Job [
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] launched in
k8s, monitoring
# Something comes along and kills the task directly. At this point KTR
thinks it is running, TaskQueue knows it is successful and dead, they are now
out of sync by one.
2023-02-18T16:35:06,643 INFO [TaskQueue-Manager]
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Cleaned up k8s
task: [
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
```
- Any debugging that you have already done
What happens is there is some process that spams KubernetesTaskRunner (aka
KTR) constantly, about three times a second with every job that it wants to run
(so if a parallel index has 20 subtasks, those 20 tasks will be constantly
spammed even if they are alerady running). Most of the time this is fine
because KTR has its own internal synchronized `tasks` variable that checks to
see if the task is already in its list and if it is, it just returns it.
When a task finishes, KTR will do a bunch of cleanup, remove the task from
the list, upload logs, etc, and kill the task. Unfortunately right as it
finishes there is a good chance that the thread that is spamming jobs
(qtp1082073129-146 which is I guess the parallel index process?) will call run
after it has been removed from tasks, placing it right back into the list. KTR
then just spawns it again. At some point the TaskQueueManager? notices a task
running that was already successful so it just kills it, but nothing ever
removes it from the list of tasks KTR has as its state.
So the TaskQueue has a list of 25 tasks, KTR has a list of 25 tasks, but
then a task ends, both go to 24, then KTR gets the task that just ended again
going back to 25, so now we have 24 vs 25. The TaskQueue tries to spawn another
task.
At this point KTR has a thread pool that is queue.maxSize big. So it can
only spawn 25 tasks and until another task ends it cannot spawn a 26th.
Eventually one task will complete, it will take the next WAITING task and spawn
that. Eventually the bug keeps occurring and there are more and more ghost
tasks in its list until every thread is taken up and it won't spawn anymore
tasks. I expect that after druid.indexer.runner.maxTaskDuration (which is
abnormally high in my case), KTR's monitor will expire and it will either drop
the task quietly or check its status, not really sure, so if your task duration
limit is low and your task limit is high, you might not notice this bug.
I ended up masking this behavior in my cluster by doing the following, which
is obviously not a permanent fix but it got me going.
```diff
---
a/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
+++
b/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
@@ -134,7 +134,7 @@ public class KubernetesTaskRunner implements
TaskLogStreamer, TaskRunner
this.node = node;
this.cleanupExecutor = Executors.newScheduledThreadPool(1);
this.exec = MoreExecutors.listeningDecorator(
- Execs.multiThreaded(taskQueueConfig.getMaxSize(),
"k8s-task-runner-%d")
+ Execs.multiThreaded(taskQueueConfig.getMaxSize() * 50,
"k8s-task-runner-%d")
);
Preconditions.checkArgument(
taskQueueConfig.getMaxSize() < Integer.MAX_VALUE,
```
I'm not quite knowledgeable enough about druid to know what the right fix
is as KTR can't query back to the TaskQueue to find out whether the task it is
about to run should be run, nor should it. Also not sure why the parallel index
spams this class so hard to create new jobs that it should already know are
running, not sure if that is intended or if I have something misconfigured.
This bug would sure manifest a lot less often if it didn't do that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]