Answer inline.

Zhang, James <[email protected]> 于2020年1月9日周四 下午1:35写道:

> Hi Wu Sheng,
> The storage is ES cluster with 3 nodes in K8S pods(CPU: 2 core, Memory:
> 3G) and 100GB SSD disk.
>

The resources are not enough, especially CPU. At least provide 6-8 core,
and 8-10G per node. And please add monitoring plugin for es too.


>
> I also found a lot "GRPCRemoteClient DEADLINE_EXCEEDED" error log both in
> OAP and Java agent logs.
> 2020-01-08 05:38:41,512 -
> org.apache.skywalking.oap.server.core.remote.client.GRPCRemoteClient
> -446965 [grpc-default-executor-2] ERROR [] - DEADLINE_EXCEEDED: deadline
> exceeded after 19999971927ns
> io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after
> 19999971927ns
>

This means, the communication between OAP nodes are blocked too, mostly due
to the storage performance.
Please consider opening SkyWalking Prometheus monitoring.



>
> Does that mean the OAP service is not powerful enough to receiving trace
> date from massive Java agents clients?
>

>From my understanding, both OAP and ES are not enough



>
> What's the recommend K8S resourced configuration for OAP/ES/UI & Java
> service agent (buffer) configuration for PRODUCT env?
>

There is no specific requirement, most are related to your payload. I could
share, the user having 10b traces per day, having 25+ OAP and ES nodes.



>
> P.S.
> Java services and Skywalking are deployed into same K8S cluster. Therefore
> there should be no network bottleneck for trace data traffic.
>

Network would not be the first concern.

Notice, SkyWalking is a complex APM system, it is not like Zipkin or
Prometheus. So you should expect much more resources used than those two.



>
> Thanks & Best Regards
>
> Xiaochao Zhang(James)
> DI SW CAS MP EMK DO-CHN
> No.7, Xixin Avenue, Chengdu High-Tech Zone
> Chengdu, China  611731
> Cellphone: +86 13980787820
> Email: [email protected]
>
> -----Original Message-----
> From: Sheng Wu <[email protected]>
> Sent: Wednesday, January 8, 2020 5:35 PM
> To: dev <[email protected]>
> Subject: Re: Java agent buffer setting to avoid trace segment abandon
>
> What is the storage?
>
> And clearly, your backend is not powerful enough. 2 core is just for quick
> startup. Since you don't give enough resources, SkyWalking will work on
> protection mode, trace will be ignored.
>
> Sheng Wu 吴晟
> Twitter, wusheng1108
>
>
> Zhang, James <[email protected]> 于2020年1月8日周三 下午5:18写道:
>
> > Thanks for the quick response, Wu Sheng.
> > Env detail:
> > The Java services are deployed into K8S pod with 18 instances: CPU 3
> > cores, Memory: 6G The OAP services are deployed into K8S pod with 3
> > instances: CPU 2 cores,
> > Memory: 2G
> >
> > The performance pressure test for Java services is about 6500 cps.
> >
> > I grep single POD skywaking logs for "abandoned" and 700,000 + records
> > were found.
> >
> > Thanks & Best Regards
> >
> > Xiaochao Zhang(James)
> > DI SW CAS MP EMK DO-CHN
> > No.7, Xixin Avenue, Chengdu High-Tech Zone Chengdu, China  611731
> > Cellphone: +86 13980787820
> > Email: [email protected]
> >
> > -----Original Message-----
> > From: Sheng Wu <[email protected]>
> > Sent: Wednesday, January 8, 2020 3:49 PM
> > To: dev <[email protected]>
> > Subject: Re: Java agent buffer setting to avoid trace segment abandon
> >
> > Hi Zhang
> >
> > Welcome to join the dev ml. How much payload do you put in the tests?
> > Does your backend and storage are powerful enough?
> > I would prefer you could share more information about your test env.
> >
> > Sheng Wu 吴晟
> > Twitter, wusheng1108
> >
> >
> > Zhang, James <[email protected]> 于2020年1月8日周三 下午3:46写道:
> >
> > > Dear Skywalking Dev,
> > > I found a lot of "trace segment has been abandoned, cause by buffer
> > > is full" logs in my Java services with Skywalking 6.6.0 agent enabled.
> > > DEBUG 2020-01-06 20:43:17:699 http-nio-0.0.0.0-9090-exec-154
> > > TraceSegmentServiceClient : One trace segment has been abandoned,
> > > cause by buffer is full.
> > >
> > > And some "xxx trace segments have been abandoned, cause by no
> > > available channel" logs were found  also.
> > > 2020-01-06 21:37:53:716 DataCarrier.DEFAULT.Consumser.0.Thread
> > > TraceSegmentServiceClient : 237 trace segments have been abandoned,
> > > cause by no available channel.
> > >
> > > I checked the source code & documentation found that the default
> > > buffer setting is 5(channel_size)*300(buffer_size) and it seems that
> > > this default setting is not enough for productive environment of
> > > heavy
> > load system.
> > >
> > > To avoid the trace segment abandon, is that OK to just increase the
> > > buffer setting (e.g. to 10* 3000) ? How to estimate the memory(in
> > > MB) for the buffer setting so that I can evaluate the memory
> > > footprint for segments buffer?
> > >
> > > Thanks & Best Regards
> > >
> > > Xiaochao Zhang(James)
> > > DI SW CAS MP EMK DO-CHN
> > > No.7, Xixin Avenue, Chengdu High-Tech Zone Chengdu, China  611731
> > > Email: [email protected]
> > > <mailto:[email protected]>
> > >
> > >
> >
>

Reply via email to