[GitHub] [druid] mindreader opened a new issue, #13841: KubernetesTaskRunner gets out of sync with TaskQueue, stops running tasks (race condition)

via GitHub Thu, 23 Feb 2023 12:27:35 -0800


mindreader opened a new issue, #13841:
URL: https://github.com/apache/druid/issues/13841


   ### Affected Version
   
   25.0.0 - and I didn't see any patches since then that would affect this
   
   ### Description
   
   
   - Cluster info
   
   A pretty new cluster with 10 historicals, really just importing data right 
now in big batches, which is how the problem manifests
   
   - Configuration info
   
   Disabling zookeeper, using kubernetes leader election and mmless druid so 
that tasks are spawned in kubernetes, no middlemanager
         # most of the configs I'm using that might be related to this bug
   
         druid.zk.service.enabled=false
         druid.discovery.type=k8s
         druid.serverview.type=http
         druid.coordinator.loadqueuepeon.type=http
         druid.discovery.k8s.clusterIdentifier=${env:CLUSTER_ID}
   
         druid.storage.type=s3
         druid.storage.bucket=${env:AWS_BUCKET}
         druid.storage.baseKey=${env:STORAGE_BASE_KEY}
         druid.storage.archiveBaseKey=${env:STORAGE_ARCHIVE_BASE_KEY}
         druid.storage.disableAcl=true
         druid.s3.endpoint.url=${env:AWS_S3_ENDPOINT_URL}
         druid.s3.forceGlobalBucketAccessEnabled=true
   
         druid.extensions.loadList=["druid-basic-security", 
"druid-s3-extensions","druid-kafka-indexing-service","druid-datasketches","postgresql-metadata-storage",
 "druid-kubernetes-extensions", "druid-kubernetes-overlord-extensions", 
"druid-parquet-extensions", "druid-multi-stage-query", "druid-stats", 
"druid-time-min-max"]
   
         druid.indexer.runner.javaOptsArray=["-server", "-Xms${TASK_MEMORY}", 
"-Xmx${TASK_MEMORY}", "-XX:MaxDirectMemorySize=${TASK_MEMORY}", 
"-Duser.timezone=UTC", "-Dfile.encoding=UTF-8", "-XX:+ExitOnOutOfMemoryError", 
"-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager"]
         druid.indexer.runner.debugJobs=false
         druid.indexer.runner.maxTaskDuration=PT96H # was attempting multiple 
day batch tasks, but reliability is not quite there
         druid.indexer.runner.taskCleanupDelay=P1D
         druid.indexer.runner.graceTerminationPeriodSeconds=PT30S
         druid.indexer.queue.maxSize=${OVERLORD_MAX_TASK_SLOTS}
   
         # TODO disable middlemanagers in favor of k8s tasks
         druid.indexer.runner.type=k8s
         druid.indexer.runner.namespace=${KUBE_NAMESPACE}
         druid.indexer.storage.type=metadata
         druid.indexer.task.encapsulatedTask=true
   
         druid.indexer.logs.type=s3
         druid.indexer.logs.s3Bucket=${env:AWS_BUCKET}
         druid.indexer.logs.s3Prefix=${env:STORAGE_INDEXER_LOGS_BASE_KEY}
         druid.indexer.logs.disableAcl=true
   
         druid.processing.intermediaryData.storage.type=deepstore
         druid.coordinator.asOverlord.enabled=true
   
   - Steps to reproduce the problem
   
     Set druid.indexer.queue.maxSize to something low, like 25. Then do a bit 
native batch parallel index import that involves hundreds of subtasks. 
Everything will work for awhile, with 25 tasks going simultaneously, but soon 
new tasks will begin to be stuck in a WAITING status, and there will be no 
corresponding job in k8s for it. As time goes on more tasks will go straight 
into WAITING and will not spawn. Eventually there will be no more running tasks 
in the system and all tasks (all 25 of them) that attempt to run will be in 
waiting. You will have to restart your coordinator to get tasks running again.
   
   - The error message or stack traces encountered. Providing more context, 
such as nearby log messages or even entire logs, can be helpful.
   Here is a log that shows the problem. I added extra logs to narrow down 
where the problem is. Explanation of the bug after the log.
   
   ```
   2023-02-18T16:29:06,342 INFO [qtp1082073129-150] 
org.apache.druid.indexing.overlord.MetadataTaskStorage - Inserting task 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z 
with status: 
TaskStatus{id=partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 status=RUNNING, duration=-1, errorMsg=null}
   2023-02-18T16:29:06,347 INFO [qtp1082073129-150] 
org.apache.druid.indexing.overlord.TaskLockbox - Adding 
task[partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z]
 to activeTasks
   2023-02-18T16:29:06,630 INFO [TaskQueue-Manager] 
org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
   2023-02-18T16:29:06,633 INFO [TaskQueue-Manager] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - placed task 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z 
into before 11 after 12
   2023-02-18T16:29:06,633 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - preparing to run k8s task 
[ 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   2023-02-18T16:29:06,694 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - existing job was not 
present [ 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   2023-02-18T16:29:06,694 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - Peon Command for K8s job: 
/peon.sh 
var/druid/task/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
 1
   2023-02-18T16:29:06,819 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - Job created [ 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 partialdimensiondistributiondownloadstechjjgcgced20230218t16290] and ready to 
launch
   2023-02-18T16:29:06,886 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Successfully 
submitted job: [ 
partialdimensiondistributiondownloadstechjjgcgced20230218t16290, 
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] ... waiting 
for job to launch
   2023-02-18T16:29:21,264 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Took task [ 
partialdimensiondistributiondownloadstechjjgcgced20230218t16290, 
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] 14444 ms for 
pod to startup
   2023-02-18T16:29:21,266 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - Job [ 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 partialdimensiondistributiondownloadstechjjgcgced20230218t16290] launched in 
k8s, monitoring
   
   #  Task / job is running at this point, then it finishes normally
   
   2023-02-18T16:35:00,252 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - finished waiting for task 
[ 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 partialdimensiondistributiondownloadstechjjgcgced20230218t16290] phase 
SUCCEEDED
   2023-02-18T16:35:00,258 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.common.JobResponse - Duration for Job: 
partialdimensiondistributiondownloadstechjjgcgced20230218t16290 was 354000 
seconds
   2023-02-18T16:35:00,258 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.common.JobResponse - Duration for Job: 
partialdimensiondistributiondownloadstechjjgcgced20230218t16290 was 354000 
seconds
   2023-02-18T16:35:00,258 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - cleaning up task 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
   2023-02-18T16:35:00,391 INFO [k8s-task-runner-11] 
org.apache.druid.storage.s3.S3TaskLogs - Pushing task log 
/tmp/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z11032330796277621083log
 to: 
druid/indexer_logs/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z/log
   2023-02-18T16:35:00,393 INFO [k8s-task-runner-11] 
org.apache.druid.storage.s3.S3Utils - Pushing 
[/tmp/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z11032330796277621083log]
 to bucket[simplecast-analytics-records] and 
key[druid/indexer_logs/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z/log].
   2023-02-18T16:35:01,267 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Cleaned up k8s 
task: [ 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   2023-02-18T16:35:01,267 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - about to clean up task 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z 
task tasks len 20
   2023-02-18T16:35:01,267 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - just cleaned up up task 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z 
task len 19
   2023-02-18T16:35:01,267 INFO [k8s-task-runner-11] 
org.apache.druid.indexing.overlord.TaskQueue - Received SUCCESS status for 
task: 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
   2023-02-18T16:35:01,268 INFO [k8s-task-runner-11] 
org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z 
to status: 
TaskStatus{id=partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 status=SUCCESS, duration=354000, errorMsg=null}
   
   # Cleanup has finishes and we are good to go, but then oops, qtp reruns the 
same job that was just running just 50ms after we finished cleaning up. End 
result TaskQueue has 19, KTR has 20. TaskQueue will try to spawn another job 
now, KTR will do so if it has spare threads, but if the limit was 20, it will 
fail to do so from here on out.
   
   2023-02-18T16:35:01,308 INFO [qtp1082073129-146] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - placed task 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z 
into before 19 after 20
   2023-02-18T16:35:01,308 INFO [k8s-task-runner-20] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - preparing to run k8s task 
[ 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   2023-02-18T16:35:01,377 INFO [k8s-task-runner-11] 
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Failed to 
cleanup task: [ 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   2023-02-18T16:35:01,377 INFO [k8s-task-runner-11] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z]
 from activeTasks
   2023-02-18T16:35:01,378 INFO [k8s-task-runner-11] 
org.apache.druid.indexing.overlord.TaskQueue - Task SUCCESS: 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z 
(354000 run duration)
   2023-02-18T16:35:01,403 INFO [k8s-task-runner-20] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - existing job was not 
present [ 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   2023-02-18T16:35:01,404 INFO [k8s-task-runner-20] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - Peon Command for K8s job: 
/peon.sh 
var/druid/task/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
 1
   2023-02-18T16:35:01,461 INFO [k8s-task-runner-20] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - Job created [ 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 partialdimensiondistributiondownloadstechjjgcgced20230218t16290] and ready to 
launch
   2023-02-18T16:35:01,525 INFO [k8s-task-runner-20] 
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Successfully 
submitted job: [ 
partialdimensiondistributiondownloadstechjjgcgced20230218t16290, 
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] ... waiting 
for job to launch
   2023-02-18T16:35:03,290 INFO [k8s-task-runner-20] 
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Took task [ 
partialdimensiondistributiondownloadstechjjgcgced20230218t16290, 
partialdimensiondistributiondownloadstechjjgcgced20230218t16290] 1829 ms for 
pod to startup
   2023-02-18T16:35:03,306 INFO [k8s-task-runner-20] 
org.apache.druid.k8s.overlord.KubernetesTaskRunner - Job [ 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 partialdimensiondistributiondownloadstechjjgcgced20230218t16290] launched in 
k8s, monitoring
   
   # Something comes along and kills the task directly. At this point KTR 
thinks it is running, TaskQueue knows it is successful and dead, they are now 
out of sync by one.
   
   2023-02-18T16:35:06,643 INFO [TaskQueue-Manager] 
org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Cleaned up k8s 
task: [ 
partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z,
 partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   ```
   
   - Any debugging that you have already done
   
   What happens is there is some process that spams KubernetesTaskRunner (aka 
KTR) constantly, about three times a second with every job that it wants to run 
(so if a parallel index has 20 subtasks, those 20 tasks will be constantly 
spammed even if they are alerady running). Most of the time this is fine 
because KTR has its own internal synchronized `tasks` variable that checks to 
see if the task is already in its list and if it is, it just returns it.
   
   When a task finishes, KTR will do a bunch of cleanup, remove the task from 
the list, upload logs, etc, and kill the task. Unfortunately right as it 
finishes there is a good chance that the thread that is spamming jobs 
(qtp1082073129-146 which is I guess the parallel index process?) will call run 
after it has been removed from tasks, placing it right back into the list. KTR 
then just spawns it again. At some point the TaskQueueManager? notices a task 
running that was already successful so it just kills it, but nothing ever 
removes it from the list of tasks KTR has as its state.
   
   So the TaskQueue has a list of 25 tasks, KTR has a list of 25 tasks, but 
then a task ends, both go to 24, then KTR gets the task that just ended again 
going back to 25, so now we have 24 vs 25. The TaskQueue tries to spawn another 
task.
   
   At this point KTR has a thread pool that is queue.maxSize big. So it can 
only spawn 25 tasks and until another task ends it cannot spawn a 26th. 
Eventually one task will complete, it will take the next WAITING task and spawn 
that. Eventually the bug keeps occurring and there are more and more ghost 
tasks in its list until every thread is taken up and it won't spawn anymore 
tasks. I expect that after druid.indexer.runner.maxTaskDuration (which is 
abnormally high in my case), KTR's monitor will expire and it will either drop 
the task quietly or check its status, not really sure, so if your task duration 
limit is low and your task limit is high, you might not notice this bug.
   
   I ended up masking this behavior in my cluster by doing the following, which 
is obviously not a permanent fix but it got me going.
   
   ```diff
   --- 
a/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
   +++ 
b/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
   @@ -134,7 +134,7 @@ public class KubernetesTaskRunner implements 
TaskLogStreamer, TaskRunner
        this.node = node;
        this.cleanupExecutor = Executors.newScheduledThreadPool(1);
        this.exec = MoreExecutors.listeningDecorator(
   -        Execs.multiThreaded(taskQueueConfig.getMaxSize(), 
"k8s-task-runner-%d")
   +        Execs.multiThreaded(taskQueueConfig.getMaxSize() * 50, 
"k8s-task-runner-%d")
        );
        Preconditions.checkArgument(
            taskQueueConfig.getMaxSize() < Integer.MAX_VALUE,
   ```
   
    I'm not quite knowledgeable enough about druid to know what the right fix 
is as KTR can't query back to the TaskQueue to find out whether the task it is 
about to run should be run, nor should it. Also not sure why the parallel index 
spams this class so hard to create new jobs that it should already know are 
running, not sure if that is intended or if I have something misconfigured. 
This bug would sure manifest a lot less often if it didn't do that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] mindreader opened a new issue, #13841: KubernetesTaskRunner gets out of sync with TaskQueue, stops running tasks (race condition)

Reply via email to