MMirelli opened a new issue, #3751: URL: https://github.com/apache/bookkeeper/issues/3751
**BUG REPORT** ***Describe the bug*** Pulsar broker cannot start due to bookies timing out, when using [BookKeeper DS fork v4.14.6.1.0.0](https://github.com/datastax/bookkeeper/releases/tag/ds-4.14.6.1.0.0). The following error manifests on the broker pods. No other error / exception is logged by the bookies. After git-bisecting the issue against the DS fork, the error appears to have been introduced by https://github.com/apache/bookkeeper/pull/3528. ``` 2023-01-19T13:32:45,290+0000 [BookKeeperClientWorker-OrderedExecutor-3-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L3 E0-E0, Sent to [my-cluster-bookkeeper-7.my-cluster-bookkeeper.pulsar.svc.cluster.local:3181, my-cluster-bookkeeper-1.my-cluster-bookkeeper.pulsar.svc.cluster.local:3181], Heard from [] : bitset = {}, Error = 'Bookie operation timeout'. First unread entry is (-1, rc = null) 2023-01-19T13:32:45,290+0000 [BookKeeperClientWorker-OrderedExecutor-3-0] ERROR org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl - [pulsar/my-cluster-broker-0.my-cluster-broker.pulsar.svc.cluster.local:8080/healthcheck] Put schema failed java.util.concurrent.CompletionException: org.apache.pulsar.broker.service.schema.exceptions.SchemaException: Bookie operation timeout - ledger=3 - operation=Failed to read entry - entry=0 at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367) ~[?:?] at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376) ~[?:?] at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?] at org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl.lambda$trimDeletedSchemaAndGetList$24(SchemaRegistryServiceImpl.java:463) ~[com.datastax.oss-pulsar-broker-2.10.3.0.jar:2.10.3.0] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ~[?:?] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[?:?] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?] at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage$Functions.lambda$getLedgerEntry$0(BookkeeperSchemaStorage.java:651) ~[com.datastax.oss-pulsar-broker-2.10.3.0.jar:2.10.3.0] at org.apache.bookkeeper.client.LedgerHandle$6.onFailure(LedgerHandle.java:825) ~[com.datastax.oss-bookkeeper-server-4.14.6.1.0.0.jar:4.14.6.1.0.0] at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38) ~[com.datastax.oss-bookkeeper-common-4.14.6.1.0.0.jar:4.14.6.1.0.0] at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) ~[com.datastax.oss-bookkeeper-common-4.14.6.1.0.0.jar:4.14.6.1.0.0] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?] at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.86.Final.jar:4.1.86.Final] at java.lang.Thread.run(Thread.java:829) ~[?:?] Caused by: org.apache.pulsar.broker.service.schema.exceptions.SchemaException: Bookie operation timeout - ledger=3 - operation=Failed to read entry - entry=0 at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.bkException(BookkeeperSchemaStorage.java:707) ~[com.datastax.oss-pulsar-broker-2.10.3.0.jar:2.10.3.0] ... 11 more ``` ***To Reproduce*** Steps to reproduce the behavior: 1. Deploy pulsar via the [DataStax pulsar helmchart](https://github.com/datastax/pulsar-helm-chart), using the values: ``` image: brokerSts: repository: dsmmirelli/lunastreaming-all pullPolicy: IfNotPresent imagePullSecrets: - name: aws-ecr-bellburnell-cred tag: bk-cef988f1 bookkeeper: repository: dsmmirelli/lunastreaming-all pullPolicy: IfNotPresent imagePullSecrets: - name: aws-ecr-bellburnell-cred tag: bk-cef988f1 bookkeeper: component: bookkeeper replicaCount: 3 annotations: prometheus.io/scrape: "true" prometheus.io/port: "8000" tolerations: - key: tier effect: NoSchedule operator: Equal value: pulsar probe: port: 3181 initial: 10 period: 30 resources: requests: memory: 9Gi cpu: 1 configData: BOOKIE_MEM: "\"-Xms4500m -Xmx4500m -XX:MaxDirectMemorySize=4500m -Dio.netty.recycler.linkCapacity=1024 -XX:+ParallelRefProcEnabled -XX:+DisableExplicitGC -XX:+ExitOnOutOfMemoryError -XX:+PerfDisableSharedMem\"" BOOKIE_GC: "\"-XX:+UseG1GC -XX:MaxGCPauseMillis=10\"" BOOKIE_EXTRA_OPTS: "\"-Dpulsar.allocator.leak_detection=Simple -Dio.netty.leakDetectionLevel=simple\"" statsProviderClass: org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider useHostNameAsBookieID: "true" majorCompactionInterval: "1200" minorCompactionInterval: "600" compactionRateByEntries: "5000" gcWaitTime: "300000" PULSAR_PREFIX_diskUsageLwmThreshold: "0.90" PULSAR_PREFIX_isForceGCAllowWhenNoSpace: "true" service: annotations: publishNotReadyAddresses: "true" ports: - name: server port: 3181 brokerSts: component: broker replicaCount: 1 annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" tolerations: - key: tier effect: NoSchedule operator: Equal value: pulsar functionsWorkerEnabled: no webSocketServiceEnabled: no ledger: defaultEnsembleSize: 2 defaultAckQuorum: 2 defaultWriteQuorum: 2 probe: port: 8443 initial: 10 period: 30 resources: requests: memory: 10Gi cpu: 5 # authenticationProviders: "com.datastax.oss.pulsar.auth.AuthenticationProviderOpenID" configData: PULSAR_MEM: "\"-Xms5g -Xmx5g -XX:MaxDirectMemorySize=5g -Dio.netty.recycler.linkCapacity=1024 -XX:+ParallelRefProcEnabled -XX:+DisableExplicitGC -XX:+ExitOnOutOfMemoryError -XX:+PerfDisableSharedMem -Dpulsar.allocator.leak_detection=Simple -Dio.netty.leakDetectionLevel=simple\"" PULSAR_GC: "\"-XX:+UseG1GC -XX:MaxGCPauseMillis=10\"" exposeTopicLevelMetricsInPrometheus: "true" brokerDeleteInactiveTopicsEnabled: "false" brokerDeleteInactivePartitionedTopicMetadataEnabled: "false" forceDeleteNamespaceAllowed: "true" topicPublisherThrottlingTickTimeMillis: "10" brokerPublisherThrottlingTickTimeMillis: "50" managedLedgerMinLedgerRolloverTimeMinutes: "15" managedLedgerMaxLedgerRolloverTimeMinutes: "240" managedLedgerMaxSizePerLedgerMbytes: "2048" backlogQuotaDefaultRetentionPolicy: "producer_exception" dispatcherMaxReadBatchSize: "1000" allowAutoTopicCreation: "true" exposeConsumerLevelMetricsInPrometheus: "false" subscriptionExpirationTimeMinutes: "20160" topicFencingTimeoutSeconds: "5" replicationConnectionsPerBroker: "1" forceDeleteTenantAllowed: "true" managedLedgerNumSchedulerThreads: "8" webSocketNumIoThreads: "8" managedLedgerNumWorkerThreads: "8" numWorkerThreadsForNonPersistentTopic: "8" webSocketConnectionsPerBroker: "8" zooKeeperCacheExpirySeconds: "300" zooKeeperOperationTimeoutSeconds: "30" zooKeeperSessionTimeoutMillis: "30000" zookeeperSessionExpiredPolicy: shutdown loadBalancerOverrideBrokerNicSpeedGbps: "128" # cache parameters tuning loadBalancerLoadSheddingStrategy: "org.apache.pulsar.broker.loadbalance.impl.OverloadShedder" managedLedgerCacheEvictionFrequency: '0.2' managedLedgerCacheEvictionTimeThresholdMillis: "10000" managedLedgerCacheSizeMB: '512' managedLedgerCursorBackloggedThreshold: '1000' managedLedgerMaxBacklogBetweenCursorsForCaching: '1000000' managedLedgerMaxReadsInFlightSizeInMB: '100' managedLedgerMinimumBacklogCursorsForCaching: '4' managedLedgerMinimumBacklogEntriesForCaching: '1' ``` Where `dsmmirelli/lunastreaming-all:bk-cef988f1` is the image with bk version containing https://github.com/apache/bookkeeper/pull/3528. 2. Observe the aforementioned error in the broker logs. ***Expected behavior*** The broker statefulset should start smoothly with BookKeeper ds-4.14.6.1.0.0. ***Screenshots*** If applicable, add screenshots to help explain your problem. ***Additional context*** 1. Changing the brokers / bookies' image to `dsmmirelli/lunastreaming-all:bk-dc855d22`, the broker starts smoothly. That is why we expect the regression to be in https://github.com/apache/bookkeeper/pull/3528. 2. Changing the docker image on the bookies to (not) contain the breaking change doesn't make a difference, so the issue seems to be on the BookKeeper client running on the pulsar broker. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
