sohansamant8 opened a new issue #11734:
URL: https://github.com/apache/druid/issues/11734


   Hi, 
   
   My Kafka ingestion task are failing without any error details. I submit a 
supervisor specs and within less then 12 seconds task goes from running to 
failing state. And it continues to create new tasks as well which also starts 
failing. 
   
   I have two middle managers pod running and 2 Historical nodes running. 
Currently 3 ingestion jobs are running fine, but if we start adding 4th one the 
task starts failing and supervisor goes unhealthy. 
   
   Druid version: 0.21.0
   Deployed on GCP K8 engine. 
   
   Any help on this is highly appreciated. 
   
   Attached is the config details:
   
       **brokers:**
         # Optionally specify for running broker as Deployment
         kind: StatefulSet
         nodeType: "broker"
         druid.port: {{ .Values.application.druid.port }}
         nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/broker"
         podDisruptionBudgetSpec:
           maxUnavailable: 1
         replicas: {{ .Values.application.druid.brokers.replicaCount }}
         readinessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /druid/broker/v1/readiness
             port: 8088
         env:
           - name: GOOGLE_APPLICATION_CREDENTIALS
             value: /var/secrets/google/key.json
           - name: DRUID_XMS
             value: 512m
           - name: DRUID_XMX
             value: 2048m
         runtime.properties: |
           druid.service=druid/broker
           druid.broker.http.numConnections=20
           druid.server.http.numThreads=5
           druid.processing.buffer.sizeBytes=268435456
           druid.processing.numMergeBuffers=2
           druid.processing.numThreads=1
           druid.sql.enable=true
         extra.jvm.options: |-
           -Xmx512M
           -Xms512M
   
       **coordinators:**
         # Optionally specify for running coordinator as Deployment
         kind: StatefulSet
         nodeType: "coordinator"
         druid.port: {{ .Values.application.druid.port }}
         nodeConfigMountPath: 
"/opt/druid/conf/druid/cluster/master/coordinator-overlord"
         podDisruptionBudgetSpec:
           maxUnavailable: 1
         replicas: {{ .Values.application.druid.coordinators.replicaCount }}
         livenessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: {{ .Values.application.druid.port }}
         readinessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: {{ .Values.application.druid.port }}
         env:
           - name: GOOGLE_APPLICATION_CREDENTIALS
             value: /var/secrets/google/key.json
           - name: DRUID_XMS
             value: 1g
           - name: DRUID_XMX
             value: 2048m
         runtime.properties: |
           druid.service=druid/coordinator
           druid.coordinator.startDelay=PT10S
           druid.coordinator.period=PT5S
           druid.coordinator.asOverlord.enabled=true
           druid.coordinator.asOverlord.overlordService=druid/overlord
           druid.indexer.queue.startDelay=PT30S
           druid.indexer.runner.type=remote
           druid.indexer.storage.type=metadata
           druid.indexer.runner.pendingTasksRunnerNumThreads=8
           druid.coordinator.maxNumConcurrentSubTasks=5
         extra.jvm.options: |-
           -Xmx512M
           -Xms512M
   
       **historicals:**
         kind: StatefulSet
         nodeType: "historical"
         druid.port: {{ .Values.application.druid.port }}
         nodeConfigMountPath: "/opt/druid/conf/druid/cluster/data/historical"
         podDisruptionBudgetSpec:
           maxUnavailable: 1
         replicas: {{ .Values.application.druid.historicals.replicaCount }}
         livenessProbe:
           initialDelaySeconds: 1800
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: {{ .Values.application.druid.port }}
         readinessProbe:
           httpGet:
             path: /druid/historical/v1/readiness
             port: {{ .Values.application.druid.port }}
           periodSeconds: 10
           failureThreshold: 18
         env:
           - name: GOOGLE_APPLICATION_CREDENTIALS
             value: /var/secrets/google/key.json
           - name: DRUID_XMS
             value: 1500m
           - name: DRUID_XMX
             value: 1500m
         runtime.properties: |
           druid.service=druid/historical
           druid.server.http.numThreads=10
           druid.processing.buffer.sizeBytes=536870912
           druid.processing.numMergeBuffers=2
           druid.processing.numThreads=2
           
druid.segmentCache.locations=[{\"path\":\"/druid/data/segments\",\"maxSize\":40000000000}]
           druid.server.maxSize=40000000000
         extra.jvm.options: |-
           -Xmx512M
           -Xms512M
   
       **middlemanagers:**
         druid.port: {{ .Values.application.druid.port }}
         kind: StatefulSet
         nodeType: middleManager
         nodeConfigMountPath: /opt/druid/conf/druid/cluster/data/middleManager
         env:
           - name: GOOGLE_APPLICATION_CREDENTIALS
             value: /var/secrets/google/key.json
           - name: DRUID_XMX
             value: 2048m
           - name: DRUID_XMS
             value: 2048m
         resources:
           requests:
             memory: 2.5Gi
           limits:
             memory: 3.4Gi
         podDisruptionBudgetSpec:
           maxUnavailable: 1
         replicas: {{ .Values.application.druid.middlemanagers.replicaCount }}
         livenessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: {{ .Values.application.druid.port }}
         readinessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: {{ .Values.application.druid.port }}
         runtime.properties: |
             druid.service=druid/middleManager
             druid.worker.capacity=6
             druid.indexer.runner.javaOpts=-server -Xmx3g -XX:+UseG1GC 
-XX:MaxGCPauseMillis=100 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-Duser.timezone=UTC -Dfile.encoding=UTF-8 
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
             druid.indexer.task.restoreTasksOnRestart=true
             druid.indexer.task.baseTaskDir=var/druid/task
             druid.server.http.numThreads=8
             druid.indexer.fork.property.druid.processing.numMergeBuffers=2
             
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=104857600
             druid.indexer.fork.property.druid.processing.numThreads=2
   
       routers:
         kind: StatefulSet
         nodeType: "router"
         druid.port: {{ .Values.application.druid.port }}
         nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/router"
         podDisruptionBudgetSpec:
           maxUnavailable: 1
         replicas: {{ .Values.application.druid.routers.replicaCount }}
         livenessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: {{ .Values.application.druid.port }}
         readinessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: {{ .Values.application.druid.port }}
         env:
           - name: GOOGLE_APPLICATION_CREDENTIALS
             value: key.json
           - name: DRUID_XMX
             value: 4096m
           - name: DRUID_XMS
             value: 4096m
         runtime.properties: |
           druid.service=druid/router
           druid.processing.numThreads=1
           # HTTP proxy
           druid.router.http.numConnections=50
           druid.router.http.readTimeout=PT5M
           druid.router.http.numMaxThreads=100
           druid.server.http.numThreads=100
           # Service discovery
           druid.router.defaultBrokerServiceName=druid/broker
           druid.router.coordinatorServiceName=druid/coordinator
           # Management proxy to coordinator / overlord: required for unified 
web console.
           druid.router.managementProxy.enabled=true
         extra.jvm.options: |-
           -Xmx512M
           -Xms512M
   
   [TaskError.txt](https://github.com/apache/druid/files/7213567/TaskError.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to