MMirelli opened a new issue, #3751:
URL: https://github.com/apache/bookkeeper/issues/3751

   **BUG REPORT**
   
   ***Describe the bug***
   
   Pulsar broker cannot start due to bookies timing out, when using [BookKeeper 
DS fork 
v4.14.6.1.0.0](https://github.com/datastax/bookkeeper/releases/tag/ds-4.14.6.1.0.0).
 The following error manifests on the broker pods. No other error / exception 
is logged by the bookies. After git-bisecting the issue against the DS fork, 
the error appears to have been introduced by 
https://github.com/apache/bookkeeper/pull/3528.
   
   ```
   2023-01-19T13:32:45,290+0000 [BookKeeperClientWorker-OrderedExecutor-3-0] 
ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: 
L3 E0-E0, Sent to 
[my-cluster-bookkeeper-7.my-cluster-bookkeeper.pulsar.svc.cluster.local:3181, 
my-cluster-bookkeeper-1.my-cluster-bookkeeper.pulsar.svc.cluster.local:3181], 
Heard from [] : bitset = {}, Error = 'Bookie operation timeout'. First unread 
entry is (-1, rc = null)
   2023-01-19T13:32:45,290+0000 [BookKeeperClientWorker-OrderedExecutor-3-0] 
ERROR org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl - 
[pulsar/my-cluster-broker-0.my-cluster-broker.pulsar.svc.cluster.local:8080/healthcheck]
 Put schema failed
   java.util.concurrent.CompletionException: 
org.apache.pulsar.broker.service.schema.exceptions.SchemaException: Bookie 
operation timeout -  ledger=3 - operation=Failed to read entry - entry=0
        at 
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367) 
~[?:?]
       at 
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 
~[?:?]
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
 ~[?:?]
        at 
org.apache.pulsar.broker.service.schema.SchemaRegistryServiceImpl.lambda$trimDeletedSchemaAndGetList$24(SchemaRegistryServiceImpl.java:463)
 ~[com.datastax.oss-pulsar-broker-2.10.3.0.jar:2.10.3.0]
        at 
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) 
~[?:?]
        at 
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 
~[?:?]
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
 ~[?:?]
        at 
org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage$Functions.lambda$getLedgerEntry$0(BookkeeperSchemaStorage.java:651)
 ~[com.datastax.oss-pulsar-broker-2.10.3.0.jar:2.10.3.0]
        at 
org.apache.bookkeeper.client.LedgerHandle$6.onFailure(LedgerHandle.java:825) 
~[com.datastax.oss-bookkeeper-server-4.14.6.1.0.0.jar:4.14.6.1.0.0]
        at 
org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38)
 ~[com.datastax.oss-bookkeeper-common-4.14.6.1.0.0.jar:4.14.6.1.0.0]
        at 
org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26)
 ~[com.datastax.oss-bookkeeper-common-4.14.6.1.0.0.jar:4.14.6.1.0.0]
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
 ~[?:?]
        at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
 ~[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 ~[io.netty-netty-common-4.1.86.Final.jar:4.1.86.Final]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]
   Caused by: 
org.apache.pulsar.broker.service.schema.exceptions.SchemaException: Bookie 
operation timeout -  ledger=3 - operation=Failed to read entry - entry=0
        at 
org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.bkException(BookkeeperSchemaStorage.java:707)
 ~[com.datastax.oss-pulsar-broker-2.10.3.0.jar:2.10.3.0]
        ... 11 more
   ```
   
   ***To Reproduce***
   
   Steps to reproduce the behavior:
   1. Deploy pulsar via the [DataStax pulsar 
helmchart](https://github.com/datastax/pulsar-helm-chart), using the values:
   ```
   image:
     brokerSts:
       repository: dsmmirelli/lunastreaming-all
       pullPolicy: IfNotPresent
       imagePullSecrets:
         - name: aws-ecr-bellburnell-cred
       tag: bk-cef988f1
     bookkeeper:
       repository: dsmmirelli/lunastreaming-all
       pullPolicy: IfNotPresent
       imagePullSecrets:
         - name: aws-ecr-bellburnell-cred
       tag: bk-cef988f1
   bookkeeper:
     component: bookkeeper
     replicaCount: 3
     annotations:
       prometheus.io/scrape: "true"
       prometheus.io/port: "8000"
     tolerations:
       - key: tier
         effect: NoSchedule
         operator: Equal
         value: pulsar
     probe:
       port: 3181
       initial: 10
       period: 30
     resources:
       requests:
         memory: 9Gi
         cpu: 1
     configData:
       BOOKIE_MEM: "\"-Xms4500m -Xmx4500m -XX:MaxDirectMemorySize=4500m 
-Dio.netty.recycler.linkCapacity=1024 -XX:+ParallelRefProcEnabled 
-XX:+DisableExplicitGC -XX:+ExitOnOutOfMemoryError -XX:+PerfDisableSharedMem\""
       BOOKIE_GC: "\"-XX:+UseG1GC -XX:MaxGCPauseMillis=10\""
       BOOKIE_EXTRA_OPTS: "\"-Dpulsar.allocator.leak_detection=Simple 
-Dio.netty.leakDetectionLevel=simple\""
       statsProviderClass: 
org.apache.bookkeeper.stats.prometheus.PrometheusMetricsProvider
       useHostNameAsBookieID: "true"
       majorCompactionInterval: "1200"
       minorCompactionInterval: "600"
       compactionRateByEntries: "5000"
       gcWaitTime: "300000"
       PULSAR_PREFIX_diskUsageLwmThreshold: "0.90"
       PULSAR_PREFIX_isForceGCAllowWhenNoSpace: "true"
     service:
       annotations:
         publishNotReadyAddresses: "true"
       ports:
         - name: server
           port: 3181
   brokerSts:
     component: broker
     replicaCount: 1
     annotations:
       prometheus.io/scrape: "true"
       prometheus.io/port: "8080"
     tolerations:
       - key: tier
         effect: NoSchedule
         operator: Equal
         value: pulsar
     functionsWorkerEnabled: no
     webSocketServiceEnabled: no
     ledger:
       defaultEnsembleSize: 2
       defaultAckQuorum: 2
       defaultWriteQuorum: 2
     probe:
       port: 8443
       initial: 10
       period: 30
     resources:
       requests:
         memory: 10Gi
         cpu: 5
         #  authenticationProviders: 
"com.datastax.oss.pulsar.auth.AuthenticationProviderOpenID"
     configData:
       PULSAR_MEM: "\"-Xms5g -Xmx5g -XX:MaxDirectMemorySize=5g 
-Dio.netty.recycler.linkCapacity=1024 -XX:+ParallelRefProcEnabled 
-XX:+DisableExplicitGC -XX:+ExitOnOutOfMemoryError -XX:+PerfDisableSharedMem 
-Dpulsar.allocator.leak_detection=Simple -Dio.netty.leakDetectionLevel=simple\""
       PULSAR_GC: "\"-XX:+UseG1GC -XX:MaxGCPauseMillis=10\""
       exposeTopicLevelMetricsInPrometheus: "true"
       brokerDeleteInactiveTopicsEnabled: "false"
       brokerDeleteInactivePartitionedTopicMetadataEnabled: "false"
       forceDeleteNamespaceAllowed: "true"
       topicPublisherThrottlingTickTimeMillis: "10"
       brokerPublisherThrottlingTickTimeMillis: "50"
       managedLedgerMinLedgerRolloverTimeMinutes: "15"
       managedLedgerMaxLedgerRolloverTimeMinutes: "240"
       managedLedgerMaxSizePerLedgerMbytes: "2048"
       backlogQuotaDefaultRetentionPolicy: "producer_exception"
       dispatcherMaxReadBatchSize: "1000"
       allowAutoTopicCreation: "true"
       exposeConsumerLevelMetricsInPrometheus: "false"
       subscriptionExpirationTimeMinutes: "20160"
       topicFencingTimeoutSeconds: "5"
       replicationConnectionsPerBroker: "1"
       forceDeleteTenantAllowed: "true"
       managedLedgerNumSchedulerThreads: "8"
       webSocketNumIoThreads: "8"
       managedLedgerNumWorkerThreads: "8"
       numWorkerThreadsForNonPersistentTopic: "8"
       webSocketConnectionsPerBroker: "8"
       zooKeeperCacheExpirySeconds: "300"
       zooKeeperOperationTimeoutSeconds: "30"
       zooKeeperSessionTimeoutMillis: "30000"
       zookeeperSessionExpiredPolicy: shutdown
       loadBalancerOverrideBrokerNicSpeedGbps: "128"
       # cache parameters tuning
       loadBalancerLoadSheddingStrategy: 
"org.apache.pulsar.broker.loadbalance.impl.OverloadShedder"
       managedLedgerCacheEvictionFrequency: '0.2'
       managedLedgerCacheEvictionTimeThresholdMillis: "10000"
       managedLedgerCacheSizeMB: '512'
       managedLedgerCursorBackloggedThreshold: '1000'
       managedLedgerMaxBacklogBetweenCursorsForCaching: '1000000'
       managedLedgerMaxReadsInFlightSizeInMB: '100'
       managedLedgerMinimumBacklogCursorsForCaching: '4'
       managedLedgerMinimumBacklogEntriesForCaching: '1'
   ```
   Where `dsmmirelli/lunastreaming-all:bk-cef988f1` is the image with bk 
version containing https://github.com/apache/bookkeeper/pull/3528. 
   2. Observe the aforementioned error in the broker logs.
   
   ***Expected behavior***
   
   The broker statefulset should start smoothly with BookKeeper 
ds-4.14.6.1.0.0. 
   
   ***Screenshots***
   
   If applicable, add screenshots to help explain your problem.
   
   ***Additional context***
   
   1. Changing the brokers / bookies' image to 
`dsmmirelli/lunastreaming-all:bk-dc855d22`, the broker starts smoothly. That is 
why we expect the regression to be in 
https://github.com/apache/bookkeeper/pull/3528.
   2. Changing the docker image on the bookies to (not) contain the breaking 
change doesn't make a difference, so the issue seems to be on the BookKeeper 
client running on the pulsar broker.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to