sohansamant8 opened a new issue #11734:
URL: https://github.com/apache/druid/issues/11734
Hi,
My Kafka ingestion task are failing without any error details. I submit a
supervisor specs and within less then 12 seconds task goes from running to
failing state. And it continues to create new tasks as well which also starts
failing.
I have two middle managers pod running and 2 Historical nodes running.
Currently 3 ingestion jobs are running fine, but if we start adding 4th one the
task starts failing and supervisor goes unhealthy.
Druid version: 0.21.0
Deployed on GCP K8 engine.
Any help on this is highly appreciated.
Attached is the config details:
**brokers:**
# Optionally specify for running broker as Deployment
kind: StatefulSet
nodeType: "broker"
druid.port: {{ .Values.application.druid.port }}
nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/broker"
podDisruptionBudgetSpec:
maxUnavailable: 1
replicas: {{ .Values.application.druid.brokers.replicaCount }}
readinessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /druid/broker/v1/readiness
port: 8088
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /var/secrets/google/key.json
- name: DRUID_XMS
value: 512m
- name: DRUID_XMX
value: 2048m
runtime.properties: |
druid.service=druid/broker
druid.broker.http.numConnections=20
druid.server.http.numThreads=5
druid.processing.buffer.sizeBytes=268435456
druid.processing.numMergeBuffers=2
druid.processing.numThreads=1
druid.sql.enable=true
extra.jvm.options: |-
-Xmx512M
-Xms512M
**coordinators:**
# Optionally specify for running coordinator as Deployment
kind: StatefulSet
nodeType: "coordinator"
druid.port: {{ .Values.application.druid.port }}
nodeConfigMountPath:
"/opt/druid/conf/druid/cluster/master/coordinator-overlord"
podDisruptionBudgetSpec:
maxUnavailable: 1
replicas: {{ .Values.application.druid.coordinators.replicaCount }}
livenessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: {{ .Values.application.druid.port }}
readinessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: {{ .Values.application.druid.port }}
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /var/secrets/google/key.json
- name: DRUID_XMS
value: 1g
- name: DRUID_XMX
value: 2048m
runtime.properties: |
druid.service=druid/coordinator
druid.coordinator.startDelay=PT10S
druid.coordinator.period=PT5S
druid.coordinator.asOverlord.enabled=true
druid.coordinator.asOverlord.overlordService=druid/overlord
druid.indexer.queue.startDelay=PT30S
druid.indexer.runner.type=remote
druid.indexer.storage.type=metadata
druid.indexer.runner.pendingTasksRunnerNumThreads=8
druid.coordinator.maxNumConcurrentSubTasks=5
extra.jvm.options: |-
-Xmx512M
-Xms512M
**historicals:**
kind: StatefulSet
nodeType: "historical"
druid.port: {{ .Values.application.druid.port }}
nodeConfigMountPath: "/opt/druid/conf/druid/cluster/data/historical"
podDisruptionBudgetSpec:
maxUnavailable: 1
replicas: {{ .Values.application.druid.historicals.replicaCount }}
livenessProbe:
initialDelaySeconds: 1800
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: {{ .Values.application.druid.port }}
readinessProbe:
httpGet:
path: /druid/historical/v1/readiness
port: {{ .Values.application.druid.port }}
periodSeconds: 10
failureThreshold: 18
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /var/secrets/google/key.json
- name: DRUID_XMS
value: 1500m
- name: DRUID_XMX
value: 1500m
runtime.properties: |
druid.service=druid/historical
druid.server.http.numThreads=10
druid.processing.buffer.sizeBytes=536870912
druid.processing.numMergeBuffers=2
druid.processing.numThreads=2
druid.segmentCache.locations=[{\"path\":\"/druid/data/segments\",\"maxSize\":40000000000}]
druid.server.maxSize=40000000000
extra.jvm.options: |-
-Xmx512M
-Xms512M
**middlemanagers:**
druid.port: {{ .Values.application.druid.port }}
kind: StatefulSet
nodeType: middleManager
nodeConfigMountPath: /opt/druid/conf/druid/cluster/data/middleManager
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /var/secrets/google/key.json
- name: DRUID_XMX
value: 2048m
- name: DRUID_XMS
value: 2048m
resources:
requests:
memory: 2.5Gi
limits:
memory: 3.4Gi
podDisruptionBudgetSpec:
maxUnavailable: 1
replicas: {{ .Values.application.druid.middlemanagers.replicaCount }}
livenessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: {{ .Values.application.druid.port }}
readinessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: {{ .Values.application.druid.port }}
runtime.properties: |
druid.service=druid/middleManager
druid.worker.capacity=6
druid.indexer.runner.javaOpts=-server -Xmx3g -XX:+UseG1GC
-XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-Duser.timezone=UTC -Dfile.encoding=UTF-8
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
druid.indexer.task.restoreTasksOnRestart=true
druid.indexer.task.baseTaskDir=var/druid/task
druid.server.http.numThreads=8
druid.indexer.fork.property.druid.processing.numMergeBuffers=2
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=104857600
druid.indexer.fork.property.druid.processing.numThreads=2
routers:
kind: StatefulSet
nodeType: "router"
druid.port: {{ .Values.application.druid.port }}
nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/router"
podDisruptionBudgetSpec:
maxUnavailable: 1
replicas: {{ .Values.application.druid.routers.replicaCount }}
livenessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: {{ .Values.application.druid.port }}
readinessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: {{ .Values.application.druid.port }}
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: key.json
- name: DRUID_XMX
value: 4096m
- name: DRUID_XMS
value: 4096m
runtime.properties: |
druid.service=druid/router
druid.processing.numThreads=1
# HTTP proxy
druid.router.http.numConnections=50
druid.router.http.readTimeout=PT5M
druid.router.http.numMaxThreads=100
druid.server.http.numThreads=100
# Service discovery
druid.router.defaultBrokerServiceName=druid/broker
druid.router.coordinatorServiceName=druid/coordinator
# Management proxy to coordinator / overlord: required for unified
web console.
druid.router.managementProxy.enabled=true
extra.jvm.options: |-
-Xmx512M
-Xms512M
[TaskError.txt](https://github.com/apache/druid/files/7213567/TaskError.txt)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]