YanshuoH opened a new issue #12716:
URL: https://github.com/apache/pulsar/issues/12716


   Hi,
   
   **Describe the bug**
   
   Recently encountered a memory leak in Pulsar Broker. I've tried to identify 
the issue with logs like below.
   
   I've enabled `io.netty.leakDetectionLevel` up to `paranoid` and 
`io.netty.leakDetection.targetRecords=40`, didn't find any leak messages 
appear. 
   
   We have 4 pulsar clusters:
   
   - version 2.7.3
   - all running in Kubernetes 1.20. Only one cluster have encountered such 
issue.
   
   The only difference of the cluster in question is that I've enabled 
Geo-Replication on all namespaces. So I disable it, but the memory leak remains.
   
   ```
   14:44:01.497 [pulsar-io-23-2] ERROR org.apache.pulsar.PulsarBrokerStarter - 
-- Shutting down - Received OOM exception: failed to allocate 16777216 byte(s) 
of direct memory (used: 1073741824, max: 1073741824)
   io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 
byte(s) of direct memory (used: 1073741824, max: 1073741824)
        at 
io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:776)
 ~[io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
        at 
io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:731)
 ~[io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
        at 
io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:645) 
~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
        at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:621) 
~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
        at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:204) 
~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
        at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:188) 
~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) 
~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:128) 
~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
        at 
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:378)
 ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
        at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
 ~[io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
        at 
org.apache.bookkeeper.common.allocator.impl.ByteBufAllocatorImpl.newDirectBuffer(ByteBufAllocatorImpl.java:164)
 [org.apache.bookkeeper-bookkeeper-common-allocator-4.12.0.jar:4.12.0]
        at 
org.apache.bookkeeper.common.allocator.impl.ByteBufAllocatorImpl.newDirectBuffer(ByteBufAllocatorImpl.java:158)
 [org.apache.bookkeeper-bookkeeper-common-allocator-4.12.0.jar:4.12.0]
        at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
 [io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
        at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)
 [io.netty-netty-buffer-4.1.60.Final.jar:4.1.60.Final]
        at 
io.netty.channel.unix.PreferredDirectByteBufAllocator.ioBuffer(PreferredDirectByteBufAllocator.java:53)
 
[io.netty-netty-transport-native-unix-common-4.1.60.Final-linux-x86_64.jar:4.1.60.Final]
        at 
io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:114)
 [io.netty-netty-transport-4.1.60.Final.jar:4.1.60.Final]
        at 
io.netty.channel.epoll.EpollRecvByteAllocatorHandle.allocate(EpollRecvByteAllocatorHandle.java:75)
 
[io.netty-netty-transport-native-epoll-4.1.60.Final-linux-x86_64.jar:4.1.60.Final]
        at 
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:780)
 
[io.netty-netty-transport-native-epoll-4.1.60.Final-linux-x86_64.jar:4.1.60.Final]
        at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$1.run(AbstractEpollChannel.java:425)
 
[io.netty-netty-transport-native-epoll-4.1.60.Final-linux-x86_64.jar:4.1.60.Final]
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
 [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
 [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) 
[io.netty-netty-transport-native-epoll-4.1.60.Final-linux-x86_64.jar:4.1.60.Final]
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
 [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
[io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
   ```
   
   **Screenshots**
   
   It is only a testing cluster with not much memory configured. But I believe 
the screenshots can show the leak.
   
   <img width="866" alt="Screen Shot 2021-11-10 at 9 35 54 AM" 
src="https://user-images.githubusercontent.com/6973092/141083424-9976b698-2f01-46cc-8afa-b2467b4e30f8.png";>
   
   <img width="848" alt="Screen Shot 2021-11-10 at 9 36 14 AM" 
src="https://user-images.githubusercontent.com/6973092/141083464-0eda851b-cc76-4863-b24d-4b1a70634bf5.png";>
   
   <img width="864" alt="Screen Shot 2021-11-10 at 9 35 41 AM" 
src="https://user-images.githubusercontent.com/6973092/141083380-a81eafef-9f4a-4021-b0c2-4414bd978f18.png";>
   
   <img width="861" alt="Screen Shot 2021-11-10 at 9 37 34 AM" 
src="https://user-images.githubusercontent.com/6973092/141083479-aaf6399f-e97c-4c95-b042-6818bd93115f.png";>
   
   **Additional context**
   
   I might take a wild guess that the memory leak may come from Bookie side 
data loss.
   I've found some error logs in it:
   
   ```
   11:51:51.568 [BookieReadThreadPool-OrderedExecutor-3-0] INFO  
org.apache.bookkeeper.proto.ReadEntryProcessorV3 - No ledger found while 
reading entry: 0 from ledger: 242603
   11:51:52.828 [bookie-io-1-1] INFO  
org.apache.bookkeeper.proto.BookieRequestHandler - Channels disconnected: [id: 
0x5d119911, L:/10.119.139.211:3181 ! R:/10.119.131.152:34376]
   11:52:10.560 [bookie-io-1-2] INFO  org.apache.bookkeeper.proto.AuthHandler - 
Authentication success on server side
   11:52:10.560 [bookie-io-1-2] INFO  
org.apache.bookkeeper.proto.BookieRequestHandler - Channel connected  [id: 
0xb66b4346, L:/10.119.139.211:3181 - R:/10.119.164.211:54744]
   11:52:11.149 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR 
org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while 
performing readLac from ledger: 242603
   org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 242603 not 
found
        at 
org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52)
 ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at 
org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61)
 ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.bookie.Bookie.getExplicitLac(Bookie.java:1364) 
~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at 
org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:71)
 [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at 
org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118)
 [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at 
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) 
[org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_302]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_302]
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
   11:52:11.150 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR 
org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while trying 
to read last entry: 242603
   org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 242603 not 
found
        at 
org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52)
 ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at 
org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61)
 ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.bookie.Bookie.readEntry(Bookie.java:1441) 
~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at 
org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:86)
 [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at 
org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118)
 [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at 
org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) 
[org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_302]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_302]
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_302]
   11:52:11.163 [BookieReadThreadPool-OrderedExecutor-3-0] INFO  
org.apache.bookkeeper.proto.ReadEntryProcessorV3 - No ledger found while 
reading entry: 0 from ledger: 242603
   ```
   
   This is because we've encountered some Kubernetes Node issue long times ago 
and some of the ledgers are unintentionally deleted.
   
   And I can confirm that all the bookies are working just fine without any 
memory leak issue (except the above exceptions)
   
   Tell me if there's any clue or anything I can do to diagnostic the memory 
leak.
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to