Hi William, Thanks for confirming the error stack. Hope that the new grpc release will come out soon.
Tsz-Wo On Sun, Jul 24, 2022 at 6:12 AM William Song <[email protected]> wrote: > Hi Tsz-Wo, > > After turning on paranoid detection level, we reproduced directOOM again. > In this time we found that the error stack is quick similar with this > recent grpc-java issue #9340( > https://gist.github.com/davidnadeau/4da26f072482fb58b19f4bd0379f35c7 < > https://gist.github.com/davidnadeau/4da26f072482fb58b19f4bd0379f35c7>). > They both share the same call path from RetriableStream to MessageFramer. > > Therefore, the DirectOOM problem can be concluded to a bug in grpc retry > mechanism. All we need to do is to wait for grpc-1.48.1 release and update > corresponding dependencies. Thanks again for the time you spent on this! > > Regards, > William > > > > 2022年7月21日 13:13,Tsz Wo Sze <[email protected]> 写道: > > > > Hi William, > > > > According to this comment > > https://github.com/grpc/grpc-java/issues/9340#issuecomment-1185995690 , > > they will have a fix in 1.48.1 soon. > > > > Tsz-Wo > > > > > > On Wed, Jul 20, 2022 at 7:43 PM William Song <[email protected]> wrote: > > > >> Hi Tsz-Wo, > >> > >> It indeed looks like the same problem. I’ll add > >> netty.leakDetectionLevel=paranoid to see if I can obtain more > information. > >> > >> William > >> > >>> 2022年7月21日 01:13,Tsz Wo Sze <[email protected]> 写道: > >>> > >>> Hi William, > >>> > >>> Indeed, there is a recent gRPC "ByteBuffer memory leak in retry > >> mechanism" > >>> issue; see https://github.com/grpc/grpc-java/issues/9340 . Not sure > if > >> it > >>> is the same problem you saw. > >>> > >>> Tsz-Wo > >>> > >>> > >>> On Tue, Jul 19, 2022 at 6:13 PM Tsz Wo Sze <[email protected]> wrote: > >>> > >>>> Hi William, > >>>> > >>>>> ... We use gRPC as their underlying communication channel. ... > >>>> > >>>> I searched the source code of IoTDB. IoTDB uses neither the Ratis > >>>> Streaming API nor anything in org.apache.ratis.thirdparty.io.netty. > >>>> Therefore, the leak seems to be from the gRPC library. > >>>> > >>>> Tsz-Wo > >>>> > >>>> > >>>> On Tue, Jul 19, 2022 at 1:22 AM William Song <[email protected]> > >> wrote: > >>>> > >>>>> Hi Tsz-Wo, > >>>>> > >>>>> We set up a cluster of IoTDB Datanodes, which consititude a Raft > Group > >>>>> with 3 members, and have 3 clients writing data to these 3 servers > >>>>> respectively. We use gRPC as their underlying communication channel. > >> After > >>>>> 48h of running, the 3 clients writes about 100GB data. Worth to > >> notice, 1 > >>>>> server is particularly slow and is about 2000 logs behind. In this > slow > >>>>> server we discovered the direct memory OOM error. This happens > >> occasionally > >>>>> and is not deterministic. > >>>>> > >>>>> William > >>>>> > >>>>> > >>>>> > >>>>>> 2022年7月19日 00:51,Tsz Wo Sze <[email protected]> 写道: > >>>>>> > >>>>>> Hi William, > >>>>>> > >>>>>> It does look like a leak. Could you provide the steps for > reproducing > >>>>> it? > >>>>>> > >>>>>> Tsz-Wo > >>>>>> > >>>>>> > >>>>>> On Mon, Jul 18, 2022 at 8:41 AM William Song <[email protected] > >>>>> <mailto:[email protected]>> wrote: > >>>>>> Hi, > >>>>>> > >>>>>> We discovered an error log from > >>>>> org.apache.ratis.thirdparty.io.netty.utils.ResourceLeakDetector > saying > >>>>> ByteBuf.release() is not called before it’s garbage-collected. The > >>>>> following is the error log screenshot. We encountered direct memory > OOM > >>>>> several times when running Ratis for a long time, so we assume this > >> message > >>>>> may have something to do with the direct memory OOM problem. > >>>>>> > >>>>>> Could anyone please take a look and check wether there is a memory > >>>>> leak? Thanks in advance! > >>>>>> > >>>>>> Best Wishes, > >>>>>> William > >>>>> > >>>>> > >> > >> > >
