[GitHub] [druid] EwanValentine opened a new issue #11303: Issues connecting to S3 on EKS

GitBox Wed, 26 May 2021 01:38:10 -0700


EwanValentine opened a new issue #11303:
URL: https://github.com/apache/druid/issues/11303



   I'm attempting to use S3 deep storage on EKS, however I just get a 403 
error. I'm not in a position to use a client secret pair from our AWS account 
directly. But the nodes within our K8s cluster have service accounts. Attached 
to my Druid clusters namespace is a role which has all permissions for a 
specific bucket. However, when I attempt to load the sample dataset into Druid, 
I get an AWS 403 error in the logs.
   
   There's a web token file set in the environment variables, which typically 
any AWS SDK related stuff normally picks up. I'm also explicitly passing in the 
region etc
   
   ### Affected Version
   
   `0.20, 0.21, 0.21.1-rc`
   
   ### Description
   
   Please include as much detailed information about the problem as possible.
   - Cluster size
   Two to three m5.large's
   
   - Configurations in use
   ```
   apiVersion: druid.apache.org/v1alpha1
   kind: Druid
   metadata:
     name: ewanstenant
   spec:
     commonConfigMountPath: /opt/druid/conf/druid/cluster/_common
     serviceAccount: "druid-scaling-spike"
     nodeSelector:
       service: ewanstenant-druid
     tolerations:
       - key: 'dedicated'
         operator: 'Equal'
         value: 'ewanstenant-druid'
         effect: 'NoSchedule'
     securityContext:
       fsGroup: 0
       runAsUser: 0
       runAsGroup: 0
     image: "apache/druid:0.21.1-rc1"
     startScript: /druid.sh
     jvm.options: |-
       -server
       -XX:+UseG1GC
       -Xloggc:gc-%t-%p.log
       -XX:+UseGCLogFileRotation
       -XX:GCLogFileSize=100M
       -XX:NumberOfGCLogFiles=10
       -XX:+HeapDumpOnOutOfMemoryError
       -XX:HeapDumpPath=/druid/data/logs
       -verbose:gc
       -XX:+PrintGCDetails
       -XX:+PrintGCTimeStamps
       -XX:+PrintGCDateStamps
       -XX:+PrintGCApplicationStoppedTime
       -XX:+PrintGCApplicationConcurrentTime
       -XX:+PrintAdaptiveSizePolicy
       -XX:+PrintReferenceGC
       -XX:+PrintFlagsFinal
       -Duser.timezone=UTC
       -Dfile.encoding=UTF-8
       -Djava.io.tmpdir=/druid/data
       -Daws.region=eu-west-1
       -Dorg.jboss.logging.provider=slf4j
       
-Dlog4j.shutdownCallbackRegistry=org.apache.druid.common.config.Log4jShutdown
       -Dlog4j.shutdownHookEnabled=true
       -Dcom.sun.management.jmxremote.authenticate=false
       -Dcom.sun.management.jmxremote.ssl=false
     common.runtime.properties: |
       ###############################################
       # service names for coordinator and overlord
       ###############################################
       druid.selectors.indexing.serviceName=druid/overlord
       druid.selectors.coordinator.serviceName=druid/coordinator
       ##################################################
       # Request logging, monitoring, and segment
       ##################################################
       druid.request.logging.type=slf4j
       druid.request.logging.feed=requests
       ##################################################
       # Monitoring ( enable when using prometheus )
       #################################################
       
       ################################################
       # Extensions
       ################################################
       druid.extensions.directory=/opt/druid/extensions
       
druid.extensions.loadList=["druid-s3-extensions","postgresql-metadata-storage"]
       ####################################################
       # Enable sql
       ####################################################
       druid.sql.enable=true
   
       druid.storage.type=s3
       druid.storage.bucket=druid-scaling-spike-deepstore
       druid.storage.baseKey=druid/segments
       druid.indexer.logs.directory=data/logs/
       druid.storage.sse.type=s3
       druid.storage.disableAcl=false
   
   
       # druid.storage.type=local
       # druid.storage.storageDirectory=/druid/deepstorage
   
       druid.metadata.storage.type=derby
       
druid.metadata.storage.connector.connectURI=jdbc:derby://localhost:1527/druid/data/derbydb/metadata.db;create=true
       druid.metadata.storage.connector.host=localhost
       druid.metadata.storage.connector.port=1527
       druid.metadata.storage.connector.createTables=true
   
       druid.zk.service.host=tiny-cluster-zk-0.tiny-cluster-zk
       druid.zk.paths.base=/druid
       druid.zk.service.compress=false
   
       druid.indexer.logs.type=file
       druid.indexer.logs.directory=/druid/data/indexing-logs
       druid.lookup.enableLookupSyncOnStartup=false
     volumeClaimTemplates:
       - 
         metadata:
           name: deepstorage-volume
         spec:
           accessModes:
             - ReadWriteOnce
           resources:
             requests:
               storage: 50Gi
           storageClassName: gp2
     volumeMounts:
       - mountPath: /druid/data
         name: data-volume
       - mountPath: /druid/deepstorage
         name: deepstorage-volume
     volumes:
       - name: data-volume
         emptyDir: {}
       - name: deepstorage-volume
         hostPath:
           path: /tmp/druid/deepstorage
           type: DirectoryOrCreate
   
     nodes:
       brokers: 
         kind: Deployment
         druid.port: 8080
         nodeType: broker
         nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/broker"
         env:
           - name: DRUID_XMS
             value: 12000m
           - name: DRUID_XMX
             value: 12000m
           - name: DRUID_MAXDIRECTMEMORYSIZE
             value: 8g
           - name: AWS_REGION
             value: eu-west-1
         replicas: 1
         resources:
           limits:
             cpu: 1
             memory: 8Gi
           requests:
             cpu: 1
             memory: 8Gi
         readinessProbe:
           initialDelaySeconds: 60
           periodSeconds: 10
           failureThreshold: 30
           httpGet:
             path: /druid/broker/v1/readiness
             port: 8080
         runtime.properties: |
            druid.service=druid/broker
            druid.log4j2.sourceCategory=druid/broker
            druid.broker.http.numConnections=5
            # Processing threads and buffers
            druid.processing.buffer.sizeBytes=268435456
            druid.processing.numMergeBuffers=1
            druid.processing.numThreads=4
   
       coordinators:
         druid.port: 8080
         kind: Deployment
         maxSurge: 2
         maxUnavailable: 0
         nodeType: coordinator
         nodeConfigMountPath: 
"/opt/druid/conf/druid/cluster/master/coordinator-overlord"
         podDisruptionBudgetSpec:
           maxUnavailable: 1
         replicas: 1
         resources:
           limits:
             cpu: 1000m
             memory: 1Gi
           requests:
             cpu: 500m
             memory: 1Gi
         livenessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: 8080
         readinessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: 8080
         env:
           - name: DRUID_XMS
             value: 1g 
           - name: DRUID_XMX
             value: 1g
           - name: AWS_REGION
             value: eu-west-1
         runtime.properties: |
             druid.service=druid/coordinator
             druid.log4j2.sourceCategory=druid/coordinator
             druid.indexer.runner.type=httpRemote
             druid.indexer.queue.startDelay=PT5S
             druid.coordinator.balancer.strategy=cachingCost
             druid.serverview.type=http
             druid.indexer.storage.type=metadata
             druid.coordinator.startDelay=PT10S
             druid.coordinator.period=PT5S
             druid.server.http.numThreads=5000
             druid.coordinator.asOverlord.enabled=true
             druid.coordinator.asOverlord.overlordService=druid/overlord
   
       historical:
         druid.port: 8080
         nodeType: historical
         nodeConfigMountPath: "/opt/druid/conf/druid/cluster/data/historical"
         replicas: 1
         livenessProbe:
           initialDelaySeconds: 1800
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: 8080
         readinessProbe:
           httpGet:
             path: /druid/historical/v1/readiness
             port: 8080
           periodSeconds: 10
           failureThreshold: 18
         resources:
           limits:
             cpu: 1000m
             memory: 12Gi
           requests:
             cpu: 1000m
             memory: 12Gi
         env:
           - name: DRUID_XMS
             value: 1500m
           - name: DRUID_XMX
             value: 1500m 
           - name: DRUID_MAXDIRECTMEMORYSIZE
             value: 12g
           - name: AWS_REGION
             value: eu-west-1
         runtime.properties: |
           druid.service=druid/historical
           druid.log4j2.sourceCategory=druid/historical
           # HTTP server threads
           druid.server.http.numThreads=10
           # Processing threads and buffers
           druid.processing.buffer.sizeBytes=536870912
           druid.processing.numMergeBuffers=1
           druid.processing.numThreads=2
           # Segment storage 
           
druid.segmentCache.locations=[{\"path\":\"/opt/druid/data/historical/segments\",\"maxSize\":
 10737418240}]
           druid.server.maxSize=10737418240
           # Query cache
           druid.historical.cache.useCache=true
           druid.historical.cache.populateCache=true
           druid.cache.type=caffeine
           druid.cache.sizeInBytes=256000000
         volumeClaimTemplates:
           -
             metadata:
               name: historical-volume
             spec:
               accessModes:
                 - ReadWriteOnce
               resources:
                 requests:
                   storage: 50Gi
               storageClassName: gp2
         volumeMounts:
           -
             mountPath: /opt/druid/data/historical
             name: historical-volume
   
       middlemanagers:
         druid.port: 8080
         nodeType: middleManager
         nodeConfigMountPath: /opt/druid/conf/druid/cluster/data/middleManager
         env:
           - name: DRUID_XMX
             value: 4096m
           - name: DRUID_XMS
             value: 4096m
           - name: AWS_REGION
             value: eu-west-1
           - name: AWS_DEFAULT_REGION
             value: eu-west-1
         replicas: 1
         resources:
           limits:
             cpu: 1000m
             memory: 6Gi
           requests:
             cpu: 1000m
             memory: 6Gi
         livenessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: 8080
         readinessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: 8080
         runtime.properties: |
           druid.service=druid/middleManager
           druid.worker.capacity=3
           druid.indexer.task.baseTaskDir=/opt/druid/data/middlemanager/task
           druid.indexer.runner.javaOpts=-server -XX:MaxDirectMemorySize=10240g 
-Duser.timezone=UTC -Daws.region=eu-west-1 -Dfile.encoding=UTF-8 
-Djava.io.tmpdir=/opt/druid/data/tmp -Dlog4j.debug 
-XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics 
-XX:PrintSafepointStatisticsCount=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
-XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m 
-XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:+UseG1GC 
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager 
-XX:HeapDumpPath=/opt/druid/data/logs/peon.%t.%p.hprof -Xms10G -Xmx10G
   
           # HTTP server threads
           druid.server.http.numThreads=25
           # Processing threads and buffers on Peons
           druid.indexer.fork.property.druid.processing.numMergeBuffers=2
           
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=32000000
           druid.indexer.fork.property.druid.processing.numThreads=2
         volumeClaimTemplates:
           -
             metadata:
               name: middlemanagers-volume
             spec:
               accessModes:
                 - ReadWriteOnce
               resources:
                 requests:
                   storage: 50Gi
               storageClassName: gp2
         volumeMounts:
           -
             mountPath: /opt/druid/data/historical
             name: middlemanagers-volume
   
       routers:
         kind: Deployment
         nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/router"
         livenessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: 8080
         readinessProbe:
           initialDelaySeconds: 60
           periodSeconds: 5
           failureThreshold: 3
           httpGet:
             path: /status/health
             port: 8080
         druid.port: 8080
         env:
           - name: AWS_REGION
             value: eu-west-1
           - name: AWS_DEFAULT_REGION
             value: eu-west-1
           - name: DRUID_XMX
             value: 1024m
           - name: DRUID_XMS
             value: 1024m
         resources:
           limits:
             cpu: 500m
             memory: 2Gi
           requests:
             cpu: 500m
             memory: 2Gi
         nodeType: router
         podDisruptionBudgetSpec:
           maxUnavailable: 1
         replicas: 1
         runtime.properties: |
             druid.service=druid/router
             druid.log4j2.sourceCategory=druid/router
             # HTTP proxy
             druid.router.http.numConnections=5000
             druid.router.http.readTimeout=PT5M
             druid.router.http.numMaxThreads=1000
             druid.server.http.numThreads=1000
             # Service discovery
             druid.router.defaultBrokerServiceName=druid/broker
             druid.router.coordinatorServiceName=druid/coordinator
             druid.router.managementProxy.enabled=true
         services:
           -
             metadata:
               name: router-%s-service
             spec:
               ports:
                 -
                   name: router-port
                   port: 8080
               type: NodePort
   
   ```
   
   - Steps to reproduce the problem
   - Deploy the above using the latest operator version, to an EKS cluster
   - Expose the router port using kubectl proxy:
   ```
   $ kubectl port-forward service/router-druid-ewanstenant-routers-service 
12345:8080 -n <yourtenant>
   ```
   - Load the sample dataset, using the default settings
   
   - The error message or stack traces encountered. Providing more context, 
such as nearby log messages or even entire logs, can be helpful.
   ```
   
{"ingestionStatsAndErrors":{"taskId":"index_parallel_wikipedia_pedgollm_2021-05-25T23:51:09.811Z","payload":{"ingestionState":"BUILD_SEGMENTS","unparseableEvents":{},"rowStats":{"determinePartitions":{"processed":24433,"processedWithError":0,"thrownAway":0,"unparseable":0},"buildSegments":{"processed":24433,"processedWithError":0,"thrownAway":0,"unparseable":0}},"errorMsg":"java.lang.RuntimeException:
 java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
java.io.IOException: com.amazonaws.services.s3.model.AmazonS3Exception: Access 
Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request 
ID: DJQGKG8Z57V4R2MP; S3 Extended Request ID: 
IXmXtwpGLsf1mWTrU7sJLx/cM2Cg72GarKfbsAtpt763Wi62fft6odbo/jmQ2nZOJbS6hro0/QY=), 
S3 Extended Request ID: 
IXmXtwpGLsf1mWTrU7sJLx/cM2Cg72GarKfbsAtpt763Wi62fft6odbo/jmQ2nZOJbS6hro0/QY=\n\tat
 
org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:938)\n\tat
 org.apache.druid.indexing.com
 mon.task.IndexTask.runTask(IndexTask.java:494)\n\tat 
org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:152)\n\tat
 
org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.runSequential(ParallelIndexSupervisorTask.java:964)\n\tat
 
org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.runTask(ParallelIndexSupervisorTask.java:445)\n\tat
 
org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:152)\n\tat
 
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:451)\n\tat
 
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:423)\n\tat
 java.util.concurrent.FutureTask.run(FutureTask.java:266)
   ```
   
   - Any debugging that you have already done
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] EwanValentine opened a new issue #11303: Issues connecting to S3 on EKS

Reply via email to