[GitHub] [pulsar] ckdarby opened a new issue #7058: Pulsar on EBS having poor performance

GitBox Wed, 27 May 2020 14:46:09 -0700


ckdarby opened a new issue #7058:
URL: https://github.com/apache/pulsar/issues/7058



   **Describe the bug**
   A clear and concise description of what the bug is.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   1. We're using Pulsar helm install on AWS EKS 
https://github.com/apache/pulsar-helm-chart/commit/6e9ad25ba322f6f0fc7c11c66fb88faa6d0218db
   2. Our values.yaml overrides look like this:
   
   ```yaml
   pulsar:
     namespace: cory-ebs-test
     components:
       pulsar_manager: false # UI is outdated and won't load without errors
     auth:
       authentication:
         enabled: true
     bookkeeper:
       resources:
         requests:
           memory: 11560Mi
           cpu: 1.5
       volumes:
         journal:
           size: 100Gi
         ledgers:
           size: 5Ti
       configData:
         # `BOOKIE_MEM` is used for `bookie shell`
         BOOKIE_MEM: >
           "
           -Xms1280m
           -Xmx10800m
           -XX:MaxDirectMemorySize=10800m
           "
         # we use `bin/pulsar` for starting bookie daemons
         PULSAR_MEM: >
           "
           -Xms10800m
           -Xmx10800m
           -XX:MaxDirectMemorySize=10800m
           "
         # configure the memory settings based on jvm memory settings
         dbStorage_writeCacheMaxSizeMb: "2500" #pulsar docs say 25%
         dbStorage_readAheadCacheMaxSizeMb: "2500" #pulsar docs say 25%
         dbStorage_rocksDB_writeBufferSizeMB: "64" #pulsar docs had 64
         dbStorage_rocksDB_blockCacheSize: "1073741824" #pulsar docs say 10%
         readBufferSizeBytes: "8096" #attempted doubling
     autorecovery:
       resources:
         requests:
           memory: 2048Mi
           cpu: 1
       configData:
         BOOKIE_MEM: >
           "
           -Xms1500m -Xmx1500m
           "
     broker:
       resources:
         requests:
           memory: 4096Mi
           cpu: 1
       configData:
         PULSAR_MEM: >
           "
           -Xms1024m -Xmx4096m -XX:MaxDirectMemorySize=4096m
           -Dio.netty.leakDetectionLevel=disabled
           -Dio.netty.recycler.linkCapacity=1024
           -XX:+ParallelRefProcEnabled
           -XX:+UnlockExperimentalVMOptions
           -XX:+DoEscapeAnalysis
           -XX:ParallelGCThreads=4
           -XX:ConcGCThreads=4
           -XX:G1NewSizePercent=50
           -XX:+DisableExplicitGC
           -XX:-ResizePLAB
           -XX:+ExitOnOutOfMemoryError
           -XX:+PerfDisableSharedMem
           "
     proxy:
       resources:
         requests:
           memory: 4096Mi
           cpu: 1
       configData:
         PULSAR_MEM: >
           "
           -Xms1024m -Xmx4096m -XX:MaxDirectMemorySize=4096m
           -Dio.netty.leakDetectionLevel=disabled
           -Dio.netty.recycler.linkCapacity=1024
           -XX:+ParallelRefProcEnabled
           -XX:+UnlockExperimentalVMOptions
           -XX:+DoEscapeAnalysis
           -XX:ParallelGCThreads=4
           -XX:ConcGCThreads=4
           -XX:G1NewSizePercent=50
           -XX:+DisableExplicitGC
           -XX:-ResizePLAB
           -XX:+ExitOnOutOfMemoryError
           -XX:+PerfDisableSharedMem
           "
       service:
         annotations:
           service.beta.kubernetes.io/aws-load-balancer-type: nlb
           external-dns.alpha.kubernetes.io/hostname: pulsar.internal.ckdarby
     toolset:
       resources:
         requests:
           memory: 1028Mi
           cpu: 1
       configData:
         PULSAR_MEM: >
           "
           -Xms640m
           -Xmx1028m
           -XX:MaxDirectMemorySize=1028m
           "
     grafana:
       service:
         annotations:
           external-dns.alpha.kubernetes.io/hostname: grafana.internal.ckdarby
       admin:
         user: admin
         password: 12345
   ```
   
   3. Produce message to multi-partioned topic:
   - Partitioned by 8
   - Average message size is ~1.5 KB
   - Set retention as 7 days
   - We're storing ~ 2-8 TB of retention at times
   
   4. Attempt to consume message with the offset set as earliest (thus skipping 
any rocksdb read cache, going to the backlog):
   
   Have tried Flink Pulsar connector
   Running with the Pulsar's perf reader from the toolset pod on a single 
partition topic
   
   ```json
   {
     "confFile" : "/pulsar/conf/client.conf",
     "topic" : [ "persistent://public/cory/test-ebs-partition-5" ],
     "numTopics" : 1,
     "rate" : 0.0,
     "startMessageId" : "earliest",
     "receiverQueueSize" : 1000,
     "maxConnections" : 100,
     "statsIntervalSeconds" : 0,
     "serviceURL" : "pulsar://cory-ebs-test-pulsar-proxy:6650/",
     "authPluginClassName" : 
"org.apache.pulsar.client.impl.auth.AuthenticationToken",
     "authParams" : "file:///pulsar/tokens/client/token",
     "useTls" : false,
     "tlsTrustCertsFilePath" : ""
   }
   ```
   
   
   5. Check Grafana, EBS graphs, etc
   - See really poor performance from Pulsar, 60-100 mbyte/s on the partition
   - Don't see any bottlenecks
   
   **Expected behavior**
   Pulsar is getting 60-100 mbyte/s reads off each partition.
   Would expect closer to what bookie is actually able to read off EBS at 
200-300 mbyte/s
   
   
   **Additional context**
   Here is a real example of everything I could pull, perf reader starts at 
18:31:17 UTC & ends at 18:46:37 UTC. All the graphs are during that time and in 
UTC.
   
   **Perf Reader Output***
   ```text
   18:31:17.389 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 58250.685  msg/s -- 647.672 Mbit/s
   18:31:27.389 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 58523.641  msg/s -- 667.659 Mbit/s
   18:31:37.390 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 61314.984  msg/s -- 688.519 Mbit/s
   18:31:47.390 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 64920.905  msg/s -- 748.406 Mbit/s
   18:31:57.390 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 64340.229  msg/s -- 732.601 Mbit/s
   ...
   18:42:17.416 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 64034.036  msg/s -- 723.160 Mbit/s
   18:42:27.419 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 63048.031  msg/s -- 700.458 Mbit/s
   18:42:37.421 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 69958.533  msg/s -- 817.095 Mbit/s
   18:42:47.422 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 69898.133  msg/s -- 827.770 Mbit/s
   18:42:57.422 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 62989.179  msg/s -- 726.990 Mbit/s
   18:43:07.422 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 63500.736  msg/s -- 728.683 Mbit/s
   ...
   18:45:37.430 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 55052.395  msg/s -- 645.263 Mbit/s
   18:45:47.431 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 72004.353  msg/s -- 804.856 Mbit/s
   18:45:57.431 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 86224.170  msg/s -- 954.399 Mbit/s
   18:46:07.431 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 80231.708  msg/s -- 905.096 Mbit/s
   18:46:17.432 [main] INFO  org.apache.pulsar.testclient.PerformanceReader - 
Read throughput: 73065.824  msg/s -- 864.556 Mbit/s
   ```
   
   
   
   **Bookie reading directly from EBS**
   Flushed disk cache before & this is before running the perf reader
   
![Selection_292](https://user-images.githubusercontent.com/220283/83058097-d40bd200-a025-11ea-8f9e-bd058ba0468a.png)
   
   
   **EC2 instances**
   Amount: 13
   Type: r5.large
   AZ: All in us-west-2c
   All within Kubernetes
   
   
   **EBS**
   
![Selection_009](https://user-images.githubusercontent.com/220283/83074778-b9932200-a040-11ea-92b5-0499eb19c32a.png)
   
   **Grafana Overview**
   
![Selection_008](https://user-images.githubusercontent.com/220283/83074863-e34c4900-a040-11ea-921c-e1ece4471452.png)
   
   **JVM**
   
   Bookie
   
![Selection_002](https://user-images.githubusercontent.com/220283/83074952-070f8f00-a041-11ea-82b0-cc87438c7dd9.png)
   
   Broker
   
![Selection_003](https://user-images.githubusercontent.com/220283/83074964-0ecf3380-a041-11ea-8265-156c292671c7.png)
   
   Recovery
   
![Selection_004](https://user-images.githubusercontent.com/220283/83074971-155dab00-a041-11ea-88b6-c9b13411f91d.png)
   
   Zookeeper
   
![Selection_005](https://user-images.githubusercontent.com/220283/83074988-1b538c00-a041-11ea-9200-3df3b9d121bf.png)
   
   **Bookie**
   
![Selection_006](https://user-images.githubusercontent.com/220283/83075039-31614c80-a041-11ea-8865-8fe7a2bb7acd.png)
   
   
![Selection_007](https://user-images.githubusercontent.com/220283/83075048-358d6a00-a041-11ea-88ad-6b9cb5894d73.png)
   
   **Specifically public/cory/test-ebs-partition-5**
   
![Selection_001](https://user-images.githubusercontent.com/220283/83075150-5f469100-a041-11ea-9ee7-b078fb65b99d.png)
    
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] ckdarby opened a new issue #7058: Pulsar on EBS having poor performance

Reply via email to