Hi Tsz-Wo,

After turning on paranoid detection level, we reproduced directOOM again. In 
this time we found that the error stack is quick similar with this recent 
grpc-java issue 
#9340(https://gist.github.com/davidnadeau/4da26f072482fb58b19f4bd0379f35c7 
<https://gist.github.com/davidnadeau/4da26f072482fb58b19f4bd0379f35c7>). They 
both share the same call path from RetriableStream to MessageFramer.

Therefore, the DirectOOM problem can be concluded to a bug in grpc retry 
mechanism. All we need to do is to wait for grpc-1.48.1 release and update 
corresponding dependencies. Thanks again for the time you spent on this!

Regards,
William


> 2022年7月21日 13:13,Tsz Wo Sze <[email protected]> 写道:
> 
> Hi William,
> 
> According to this comment
> https://github.com/grpc/grpc-java/issues/9340#issuecomment-1185995690 ,
> they will have a fix in 1.48.1 soon.
> 
> Tsz-Wo
> 
> 
> On Wed, Jul 20, 2022 at 7:43 PM William Song <[email protected]> wrote:
> 
>> Hi Tsz-Wo,
>> 
>> It indeed looks like the same problem. I’ll add
>> netty.leakDetectionLevel=paranoid to see if I can obtain more information.
>> 
>> William
>> 
>>> 2022年7月21日 01:13,Tsz Wo Sze <[email protected]> 写道:
>>> 
>>> Hi William,
>>> 
>>> Indeed, there is a recent gRPC "ByteBuffer memory leak in retry
>> mechanism"
>>> issue; see https://github.com/grpc/grpc-java/issues/9340 .  Not sure if
>> it
>>> is the same problem you saw.
>>> 
>>> Tsz-Wo
>>> 
>>> 
>>> On Tue, Jul 19, 2022 at 6:13 PM Tsz Wo Sze <[email protected]> wrote:
>>> 
>>>> Hi William,
>>>> 
>>>>> ... We use gRPC as their underlying communication channel. ...
>>>> 
>>>> I searched the source code of IoTDB.  IoTDB uses neither the Ratis
>>>> Streaming API nor anything in org.apache.ratis.thirdparty.io.netty.
>>>> Therefore, the leak seems to be from the gRPC library.
>>>> 
>>>> Tsz-Wo
>>>> 
>>>> 
>>>> On Tue, Jul 19, 2022 at 1:22 AM William Song <[email protected]>
>> wrote:
>>>> 
>>>>> Hi Tsz-Wo,
>>>>> 
>>>>> We set up a cluster of IoTDB Datanodes, which consititude a Raft Group
>>>>> with 3 members, and have 3 clients writing data to these 3 servers
>>>>> respectively.  We use gRPC as their underlying communication channel.
>> After
>>>>> 48h of running, the 3 clients writes about 100GB data. Worth to
>> notice, 1
>>>>> server is particularly slow and is about 2000 logs behind. In this slow
>>>>> server we discovered the direct memory OOM error. This happens
>> occasionally
>>>>> and is not deterministic.
>>>>> 
>>>>> William
>>>>> 
>>>>> 
>>>>> 
>>>>>> 2022年7月19日 00:51,Tsz Wo Sze <[email protected]> 写道:
>>>>>> 
>>>>>> Hi William,
>>>>>> 
>>>>>> It does look like a leak.  Could you provide the steps for reproducing
>>>>> it?
>>>>>> 
>>>>>> Tsz-Wo
>>>>>> 
>>>>>> 
>>>>>> On Mon, Jul 18, 2022 at 8:41 AM William Song <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> We discovered an error log from
>>>>> org.apache.ratis.thirdparty.io.netty.utils.ResourceLeakDetector saying
>>>>> ByteBuf.release() is not called before it’s garbage-collected. The
>>>>> following is the error log screenshot. We encountered direct memory OOM
>>>>> several times when running Ratis for a long time, so we assume this
>> message
>>>>> may have something to do with the direct memory OOM problem.
>>>>>> 
>>>>>> Could anyone please take a look and check wether there is a memory
>>>>> leak? Thanks in advance!
>>>>>> 
>>>>>> Best Wishes,
>>>>>> William
>>>>> 
>>>>> 
>> 
>> 

Reply via email to