Hi Tsz-Wo, After turning on paranoid detection level, we reproduced directOOM again. In this time we found that the error stack is quick similar with this recent grpc-java issue #9340(https://gist.github.com/davidnadeau/4da26f072482fb58b19f4bd0379f35c7 <https://gist.github.com/davidnadeau/4da26f072482fb58b19f4bd0379f35c7>). They both share the same call path from RetriableStream to MessageFramer.
Therefore, the DirectOOM problem can be concluded to a bug in grpc retry mechanism. All we need to do is to wait for grpc-1.48.1 release and update corresponding dependencies. Thanks again for the time you spent on this! Regards, William > 2022年7月21日 13:13,Tsz Wo Sze <[email protected]> 写道: > > Hi William, > > According to this comment > https://github.com/grpc/grpc-java/issues/9340#issuecomment-1185995690 , > they will have a fix in 1.48.1 soon. > > Tsz-Wo > > > On Wed, Jul 20, 2022 at 7:43 PM William Song <[email protected]> wrote: > >> Hi Tsz-Wo, >> >> It indeed looks like the same problem. I’ll add >> netty.leakDetectionLevel=paranoid to see if I can obtain more information. >> >> William >> >>> 2022年7月21日 01:13,Tsz Wo Sze <[email protected]> 写道: >>> >>> Hi William, >>> >>> Indeed, there is a recent gRPC "ByteBuffer memory leak in retry >> mechanism" >>> issue; see https://github.com/grpc/grpc-java/issues/9340 . Not sure if >> it >>> is the same problem you saw. >>> >>> Tsz-Wo >>> >>> >>> On Tue, Jul 19, 2022 at 6:13 PM Tsz Wo Sze <[email protected]> wrote: >>> >>>> Hi William, >>>> >>>>> ... We use gRPC as their underlying communication channel. ... >>>> >>>> I searched the source code of IoTDB. IoTDB uses neither the Ratis >>>> Streaming API nor anything in org.apache.ratis.thirdparty.io.netty. >>>> Therefore, the leak seems to be from the gRPC library. >>>> >>>> Tsz-Wo >>>> >>>> >>>> On Tue, Jul 19, 2022 at 1:22 AM William Song <[email protected]> >> wrote: >>>> >>>>> Hi Tsz-Wo, >>>>> >>>>> We set up a cluster of IoTDB Datanodes, which consititude a Raft Group >>>>> with 3 members, and have 3 clients writing data to these 3 servers >>>>> respectively. We use gRPC as their underlying communication channel. >> After >>>>> 48h of running, the 3 clients writes about 100GB data. Worth to >> notice, 1 >>>>> server is particularly slow and is about 2000 logs behind. In this slow >>>>> server we discovered the direct memory OOM error. This happens >> occasionally >>>>> and is not deterministic. >>>>> >>>>> William >>>>> >>>>> >>>>> >>>>>> 2022年7月19日 00:51,Tsz Wo Sze <[email protected]> 写道: >>>>>> >>>>>> Hi William, >>>>>> >>>>>> It does look like a leak. Could you provide the steps for reproducing >>>>> it? >>>>>> >>>>>> Tsz-Wo >>>>>> >>>>>> >>>>>> On Mon, Jul 18, 2022 at 8:41 AM William Song <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>>> Hi, >>>>>> >>>>>> We discovered an error log from >>>>> org.apache.ratis.thirdparty.io.netty.utils.ResourceLeakDetector saying >>>>> ByteBuf.release() is not called before it’s garbage-collected. The >>>>> following is the error log screenshot. We encountered direct memory OOM >>>>> several times when running Ratis for a long time, so we assume this >> message >>>>> may have something to do with the direct memory OOM problem. >>>>>> >>>>>> Could anyone please take a look and check wether there is a memory >>>>> leak? Thanks in advance! >>>>>> >>>>>> Best Wishes, >>>>>> William >>>>> >>>>> >> >>
