EwanValentine opened a new issue #11303:
URL: https://github.com/apache/druid/issues/11303
I'm attempting to use S3 deep storage on EKS, however I just get a 403
error. I'm not in a position to use a client secret pair from our AWS account
directly. But the nodes within our K8s cluster have service accounts. Attached
to my Druid clusters namespace is a role which has all permissions for a
specific bucket. However, when I attempt to load the sample dataset into Druid,
I get an AWS 403 error in the logs.
There's a web token file set in the environment variables, which typically
any AWS SDK related stuff normally picks up. I'm also explicitly passing in the
region etc
### Affected Version
`0.20, 0.21, 0.21.1-rc`
### Description
Please include as much detailed information about the problem as possible.
- Cluster size
Two to three m5.large's
- Configurations in use
```
apiVersion: druid.apache.org/v1alpha1
kind: Druid
metadata:
name: ewanstenant
spec:
commonConfigMountPath: /opt/druid/conf/druid/cluster/_common
serviceAccount: "druid-scaling-spike"
nodeSelector:
service: ewanstenant-druid
tolerations:
- key: 'dedicated'
operator: 'Equal'
value: 'ewanstenant-druid'
effect: 'NoSchedule'
securityContext:
fsGroup: 0
runAsUser: 0
runAsGroup: 0
image: "apache/druid:0.21.1-rc1"
startScript: /druid.sh
jvm.options: |-
-server
-XX:+UseG1GC
-Xloggc:gc-%t-%p.log
-XX:+UseGCLogFileRotation
-XX:GCLogFileSize=100M
-XX:NumberOfGCLogFiles=10
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/druid/data/logs
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintReferenceGC
-XX:+PrintFlagsFinal
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Djava.io.tmpdir=/druid/data
-Daws.region=eu-west-1
-Dorg.jboss.logging.provider=slf4j
-Dlog4j.shutdownCallbackRegistry=org.apache.druid.common.config.Log4jShutdown
-Dlog4j.shutdownHookEnabled=true
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
common.runtime.properties: |
###############################################
# service names for coordinator and overlord
###############################################
druid.selectors.indexing.serviceName=druid/overlord
druid.selectors.coordinator.serviceName=druid/coordinator
##################################################
# Request logging, monitoring, and segment
##################################################
druid.request.logging.type=slf4j
druid.request.logging.feed=requests
##################################################
# Monitoring ( enable when using prometheus )
#################################################
################################################
# Extensions
################################################
druid.extensions.directory=/opt/druid/extensions
druid.extensions.loadList=["druid-s3-extensions","postgresql-metadata-storage"]
####################################################
# Enable sql
####################################################
druid.sql.enable=true
druid.storage.type=s3
druid.storage.bucket=druid-scaling-spike-deepstore
druid.storage.baseKey=druid/segments
druid.indexer.logs.directory=data/logs/
druid.storage.sse.type=s3
druid.storage.disableAcl=false
# druid.storage.type=local
# druid.storage.storageDirectory=/druid/deepstorage
druid.metadata.storage.type=derby
druid.metadata.storage.connector.connectURI=jdbc:derby://localhost:1527/druid/data/derbydb/metadata.db;create=true
druid.metadata.storage.connector.host=localhost
druid.metadata.storage.connector.port=1527
druid.metadata.storage.connector.createTables=true
druid.zk.service.host=tiny-cluster-zk-0.tiny-cluster-zk
druid.zk.paths.base=/druid
druid.zk.service.compress=false
druid.indexer.logs.type=file
druid.indexer.logs.directory=/druid/data/indexing-logs
druid.lookup.enableLookupSyncOnStartup=false
volumeClaimTemplates:
-
metadata:
name: deepstorage-volume
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: gp2
volumeMounts:
- mountPath: /druid/data
name: data-volume
- mountPath: /druid/deepstorage
name: deepstorage-volume
volumes:
- name: data-volume
emptyDir: {}
- name: deepstorage-volume
hostPath:
path: /tmp/druid/deepstorage
type: DirectoryOrCreate
nodes:
brokers:
kind: Deployment
druid.port: 8080
nodeType: broker
nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/broker"
env:
- name: DRUID_XMS
value: 12000m
- name: DRUID_XMX
value: 12000m
- name: DRUID_MAXDIRECTMEMORYSIZE
value: 8g
- name: AWS_REGION
value: eu-west-1
replicas: 1
resources:
limits:
cpu: 1
memory: 8Gi
requests:
cpu: 1
memory: 8Gi
readinessProbe:
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 30
httpGet:
path: /druid/broker/v1/readiness
port: 8080
runtime.properties: |
druid.service=druid/broker
druid.log4j2.sourceCategory=druid/broker
druid.broker.http.numConnections=5
# Processing threads and buffers
druid.processing.buffer.sizeBytes=268435456
druid.processing.numMergeBuffers=1
druid.processing.numThreads=4
coordinators:
druid.port: 8080
kind: Deployment
maxSurge: 2
maxUnavailable: 0
nodeType: coordinator
nodeConfigMountPath:
"/opt/druid/conf/druid/cluster/master/coordinator-overlord"
podDisruptionBudgetSpec:
maxUnavailable: 1
replicas: 1
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 500m
memory: 1Gi
livenessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: 8080
readinessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: 8080
env:
- name: DRUID_XMS
value: 1g
- name: DRUID_XMX
value: 1g
- name: AWS_REGION
value: eu-west-1
runtime.properties: |
druid.service=druid/coordinator
druid.log4j2.sourceCategory=druid/coordinator
druid.indexer.runner.type=httpRemote
druid.indexer.queue.startDelay=PT5S
druid.coordinator.balancer.strategy=cachingCost
druid.serverview.type=http
druid.indexer.storage.type=metadata
druid.coordinator.startDelay=PT10S
druid.coordinator.period=PT5S
druid.server.http.numThreads=5000
druid.coordinator.asOverlord.enabled=true
druid.coordinator.asOverlord.overlordService=druid/overlord
historical:
druid.port: 8080
nodeType: historical
nodeConfigMountPath: "/opt/druid/conf/druid/cluster/data/historical"
replicas: 1
livenessProbe:
initialDelaySeconds: 1800
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: 8080
readinessProbe:
httpGet:
path: /druid/historical/v1/readiness
port: 8080
periodSeconds: 10
failureThreshold: 18
resources:
limits:
cpu: 1000m
memory: 12Gi
requests:
cpu: 1000m
memory: 12Gi
env:
- name: DRUID_XMS
value: 1500m
- name: DRUID_XMX
value: 1500m
- name: DRUID_MAXDIRECTMEMORYSIZE
value: 12g
- name: AWS_REGION
value: eu-west-1
runtime.properties: |
druid.service=druid/historical
druid.log4j2.sourceCategory=druid/historical
# HTTP server threads
druid.server.http.numThreads=10
# Processing threads and buffers
druid.processing.buffer.sizeBytes=536870912
druid.processing.numMergeBuffers=1
druid.processing.numThreads=2
# Segment storage
druid.segmentCache.locations=[{\"path\":\"/opt/druid/data/historical/segments\",\"maxSize\":
10737418240}]
druid.server.maxSize=10737418240
# Query cache
druid.historical.cache.useCache=true
druid.historical.cache.populateCache=true
druid.cache.type=caffeine
druid.cache.sizeInBytes=256000000
volumeClaimTemplates:
-
metadata:
name: historical-volume
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: gp2
volumeMounts:
-
mountPath: /opt/druid/data/historical
name: historical-volume
middlemanagers:
druid.port: 8080
nodeType: middleManager
nodeConfigMountPath: /opt/druid/conf/druid/cluster/data/middleManager
env:
- name: DRUID_XMX
value: 4096m
- name: DRUID_XMS
value: 4096m
- name: AWS_REGION
value: eu-west-1
- name: AWS_DEFAULT_REGION
value: eu-west-1
replicas: 1
resources:
limits:
cpu: 1000m
memory: 6Gi
requests:
cpu: 1000m
memory: 6Gi
livenessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: 8080
readinessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: 8080
runtime.properties: |
druid.service=druid/middleManager
druid.worker.capacity=3
druid.indexer.task.baseTaskDir=/opt/druid/data/middlemanager/task
druid.indexer.runner.javaOpts=-server -XX:MaxDirectMemorySize=10240g
-Duser.timezone=UTC -Daws.region=eu-west-1 -Dfile.encoding=UTF-8
-Djava.io.tmpdir=/opt/druid/data/tmp -Dlog4j.debug
-XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m
-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:+UseG1GC
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
-XX:HeapDumpPath=/opt/druid/data/logs/peon.%t.%p.hprof -Xms10G -Xmx10G
# HTTP server threads
druid.server.http.numThreads=25
# Processing threads and buffers on Peons
druid.indexer.fork.property.druid.processing.numMergeBuffers=2
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=32000000
druid.indexer.fork.property.druid.processing.numThreads=2
volumeClaimTemplates:
-
metadata:
name: middlemanagers-volume
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: gp2
volumeMounts:
-
mountPath: /opt/druid/data/historical
name: middlemanagers-volume
routers:
kind: Deployment
nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/router"
livenessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: 8080
readinessProbe:
initialDelaySeconds: 60
periodSeconds: 5
failureThreshold: 3
httpGet:
path: /status/health
port: 8080
druid.port: 8080
env:
- name: AWS_REGION
value: eu-west-1
- name: AWS_DEFAULT_REGION
value: eu-west-1
- name: DRUID_XMX
value: 1024m
- name: DRUID_XMS
value: 1024m
resources:
limits:
cpu: 500m
memory: 2Gi
requests:
cpu: 500m
memory: 2Gi
nodeType: router
podDisruptionBudgetSpec:
maxUnavailable: 1
replicas: 1
runtime.properties: |
druid.service=druid/router
druid.log4j2.sourceCategory=druid/router
# HTTP proxy
druid.router.http.numConnections=5000
druid.router.http.readTimeout=PT5M
druid.router.http.numMaxThreads=1000
druid.server.http.numThreads=1000
# Service discovery
druid.router.defaultBrokerServiceName=druid/broker
druid.router.coordinatorServiceName=druid/coordinator
druid.router.managementProxy.enabled=true
services:
-
metadata:
name: router-%s-service
spec:
ports:
-
name: router-port
port: 8080
type: NodePort
```
- Steps to reproduce the problem
- Deploy the above using the latest operator version, to an EKS cluster
- Expose the router port using kubectl proxy:
```
$ kubectl port-forward service/router-druid-ewanstenant-routers-service
12345:8080 -n <yourtenant>
```
- Load the sample dataset, using the default settings
- The error message or stack traces encountered. Providing more context,
such as nearby log messages or even entire logs, can be helpful.
```
{"ingestionStatsAndErrors":{"taskId":"index_parallel_wikipedia_pedgollm_2021-05-25T23:51:09.811Z","payload":{"ingestionState":"BUILD_SEGMENTS","unparseableEvents":{},"rowStats":{"determinePartitions":{"processed":24433,"processedWithError":0,"thrownAway":0,"unparseable":0},"buildSegments":{"processed":24433,"processedWithError":0,"thrownAway":0,"unparseable":0}},"errorMsg":"java.lang.RuntimeException:
java.util.concurrent.ExecutionException: java.lang.RuntimeException:
java.io.IOException: com.amazonaws.services.s3.model.AmazonS3Exception: Access
Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request
ID: DJQGKG8Z57V4R2MP; S3 Extended Request ID:
IXmXtwpGLsf1mWTrU7sJLx/cM2Cg72GarKfbsAtpt763Wi62fft6odbo/jmQ2nZOJbS6hro0/QY=),
S3 Extended Request ID:
IXmXtwpGLsf1mWTrU7sJLx/cM2Cg72GarKfbsAtpt763Wi62fft6odbo/jmQ2nZOJbS6hro0/QY=\n\tat
org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:938)\n\tat
org.apache.druid.indexing.com
mon.task.IndexTask.runTask(IndexTask.java:494)\n\tat
org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:152)\n\tat
org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.runSequential(ParallelIndexSupervisorTask.java:964)\n\tat
org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.runTask(ParallelIndexSupervisorTask.java:445)\n\tat
org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:152)\n\tat
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:451)\n\tat
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:423)\n\tat
java.util.concurrent.FutureTask.run(FutureTask.java:266)
```
- Any debugging that you have already done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]