Hi William,

Thanks for confirming the error stack.  Hope that the new grpc release will
come out soon.

Tsz-Wo


On Sun, Jul 24, 2022 at 6:12 AM William Song <[email protected]> wrote:

> Hi Tsz-Wo,
>
> After turning on paranoid detection level, we reproduced directOOM again.
> In this time we found that the error stack is quick similar with this
> recent grpc-java issue #9340(
> https://gist.github.com/davidnadeau/4da26f072482fb58b19f4bd0379f35c7 <
> https://gist.github.com/davidnadeau/4da26f072482fb58b19f4bd0379f35c7>).
> They both share the same call path from RetriableStream to MessageFramer.
>
> Therefore, the DirectOOM problem can be concluded to a bug in grpc retry
> mechanism. All we need to do is to wait for grpc-1.48.1 release and update
> corresponding dependencies. Thanks again for the time you spent on this!
>
> Regards,
> William
>
>
> > 2022年7月21日 13:13,Tsz Wo Sze <[email protected]> 写道:
> >
> > Hi William,
> >
> > According to this comment
> > https://github.com/grpc/grpc-java/issues/9340#issuecomment-1185995690 ,
> > they will have a fix in 1.48.1 soon.
> >
> > Tsz-Wo
> >
> >
> > On Wed, Jul 20, 2022 at 7:43 PM William Song <[email protected]> wrote:
> >
> >> Hi Tsz-Wo,
> >>
> >> It indeed looks like the same problem. I’ll add
> >> netty.leakDetectionLevel=paranoid to see if I can obtain more
> information.
> >>
> >> William
> >>
> >>> 2022年7月21日 01:13,Tsz Wo Sze <[email protected]> 写道:
> >>>
> >>> Hi William,
> >>>
> >>> Indeed, there is a recent gRPC "ByteBuffer memory leak in retry
> >> mechanism"
> >>> issue; see https://github.com/grpc/grpc-java/issues/9340 .  Not sure
> if
> >> it
> >>> is the same problem you saw.
> >>>
> >>> Tsz-Wo
> >>>
> >>>
> >>> On Tue, Jul 19, 2022 at 6:13 PM Tsz Wo Sze <[email protected]> wrote:
> >>>
> >>>> Hi William,
> >>>>
> >>>>> ... We use gRPC as their underlying communication channel. ...
> >>>>
> >>>> I searched the source code of IoTDB.  IoTDB uses neither the Ratis
> >>>> Streaming API nor anything in org.apache.ratis.thirdparty.io.netty.
> >>>> Therefore, the leak seems to be from the gRPC library.
> >>>>
> >>>> Tsz-Wo
> >>>>
> >>>>
> >>>> On Tue, Jul 19, 2022 at 1:22 AM William Song <[email protected]>
> >> wrote:
> >>>>
> >>>>> Hi Tsz-Wo,
> >>>>>
> >>>>> We set up a cluster of IoTDB Datanodes, which consititude a Raft
> Group
> >>>>> with 3 members, and have 3 clients writing data to these 3 servers
> >>>>> respectively.  We use gRPC as their underlying communication channel.
> >> After
> >>>>> 48h of running, the 3 clients writes about 100GB data. Worth to
> >> notice, 1
> >>>>> server is particularly slow and is about 2000 logs behind. In this
> slow
> >>>>> server we discovered the direct memory OOM error. This happens
> >> occasionally
> >>>>> and is not deterministic.
> >>>>>
> >>>>> William
> >>>>>
> >>>>>
> >>>>>
> >>>>>> 2022年7月19日 00:51,Tsz Wo Sze <[email protected]> 写道:
> >>>>>>
> >>>>>> Hi William,
> >>>>>>
> >>>>>> It does look like a leak.  Could you provide the steps for
> reproducing
> >>>>> it?
> >>>>>>
> >>>>>> Tsz-Wo
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Jul 18, 2022 at 8:41 AM William Song <[email protected]
> >>>>> <mailto:[email protected]>> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> We discovered an error log from
> >>>>> org.apache.ratis.thirdparty.io.netty.utils.ResourceLeakDetector
> saying
> >>>>> ByteBuf.release() is not called before it’s garbage-collected. The
> >>>>> following is the error log screenshot. We encountered direct memory
> OOM
> >>>>> several times when running Ratis for a long time, so we assume this
> >> message
> >>>>> may have something to do with the direct memory OOM problem.
> >>>>>>
> >>>>>> Could anyone please take a look and check wether there is a memory
> >>>>> leak? Thanks in advance!
> >>>>>>
> >>>>>> Best Wishes,
> >>>>>> William
> >>>>>
> >>>>>
> >>
> >>
>
>

Reply via email to